Column Descriptions:

id: (Unique id for each patient)

age: (Age of the patient in years)

origin: (place of study)

sex: (Male/Female)

cp: chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic])

trestbps: resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital))

chol: (serum cholesterol in mg/dl)

fbs: (if fasting blood sugar > 120 mg/dl)

restecg: (resting electrocardiographic results) – Values: [normal, stt abnormality, lv hypertrophy]

thalach: maximum heart rate achieved

exang: exercise-induced angina (True/ False)

oldpeak: ST depression induced by exercise relative to rest

slope: the slope of the peak exercise ST segment

ca: number of major vessels (0-3) colored by fluoroscopy

thal: [normal; fixed defect; reversible defect]

num: the predicted attribute

Data Creators: Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D. Relevant Papers: Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64,304–310. Web Link David W. Aha & Dennis Kibler. “Instance-based prediction of heart-disease presence with the Cleveland database.” Web Link Gennari, J.H., Langley, P, & Fisher, D. (1989). Models of incremental concept formation. Artificial Intelligence, 40, 11–61. Web Link

Citation Request: The authors of the databases have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the data collection at each institution. They would be:

Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.

Libraries used:

1. Preprocessing (Feature Engineering)

Let’s load the data set

First we will change some variable names in order to facilitate the project’s understanding.

So now our variables will be studied as:

summary(heart)
##        id             age            sex              dataset         
##  Min.   :  1.0   Min.   :28.00   Length:920         Length:920        
##  1st Qu.:230.8   1st Qu.:47.00   Class :character   Class :character  
##  Median :460.5   Median :54.00   Mode  :character   Mode  :character  
##  Mean   :460.5   Mean   :53.51                                        
##  3rd Qu.:690.2   3rd Qu.:60.00                                        
##  Max.   :920.0   Max.   :77.00                                        
##                                                                       
##   chest_pain        rest_blood_pressure      chol       blood_sugar    
##  Length:920         Min.   :  0.0       Min.   :  0.0   Mode :logical  
##  Class :character   1st Qu.:120.0       1st Qu.:175.0   FALSE:692      
##  Mode  :character   Median :130.0       Median :223.0   TRUE :138      
##                     Mean   :132.1       Mean   :199.1   NA's :90       
##                     3rd Qu.:140.0       3rd Qu.:268.0                  
##                     Max.   :200.0       Max.   :603.0                  
##                     NA's   :59          NA's   :30                     
##    restecg          max_heart_rate    exang            oldpeak       
##  Length:920         Min.   : 60.0   Mode :logical   Min.   :-2.6000  
##  Class :character   1st Qu.:120.0   FALSE:528       1st Qu.: 0.0000  
##  Mode  :character   Median :140.0   TRUE :337       Median : 0.5000  
##                     Mean   :137.5   NA's :55        Mean   : 0.8788  
##                     3rd Qu.:157.0                   3rd Qu.: 1.5000  
##                     Max.   :202.0                   Max.   : 6.2000  
##                     NA's   :55                      NA's   :62       
##     slope                 ca             thal               stage       
##  Length:920         Min.   :0.0000   Length:920         Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   Class :character   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Mode  :character   Median :1.0000  
##                     Mean   :0.6764                      Mean   :0.9957  
##                     3rd Qu.:1.0000                      3rd Qu.:2.0000  
##                     Max.   :3.0000                      Max.   :4.0000  
##                     NA's   :611

By taking a first look at the summary of the data set and analysing it deeply, we get some first conclusions on where to head our Preprocessing of the data:

These main ideas came from a look at the data set information, summary and understanding of the data. Nevertheless, more feature engineering strategies may appear while visualizing the data.

So from this first conclusion we get two main problems, handling the importance of the variables (which variables to delete); the type and values of the variables (factorise); and outstanding values such as NAs and outliers.

Let’s start taking care of duplicated variables and non-relevant variables:

heart$id[duplicated(heart$id)]
## integer(0)
# We see that there are no duplicated rows (no repeated people)

# Now, let's look at the variables to delete
# heart2 = heart %>% select(- id)            # second assumption
heart = heart %>% select(- id, - dataset)  # main assumption
glimpse(heart)
## Rows: 920
## Columns: 14
## $ age                 <int> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56…
## $ sex                 <chr> "Male", "Male", "Male", "Male", "Female", "Male", …
## $ chest_pain          <chr> "typical angina", "asymptomatic", "asymptomatic", …
## $ rest_blood_pressure <int> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, …
## $ chol                <int> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, …
## $ blood_sugar         <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ restecg             <chr> "lv hypertrophy", "lv hypertrophy", "lv hypertroph…
## $ max_heart_rate      <int> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, …
## $ exang               <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRU…
## $ oldpeak             <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, …
## $ slope               <chr> "downsloping", "flat", "flat", "downsloping", "ups…
## $ ca                  <int> 0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ thal                <chr> "fixed defect", "normal", "reversable defect", "no…
## $ stage               <int> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1,…
# glimpse(heart2)

1.1 Encoding

We will factorise the variables that are characters. As PCA and Factor Analysis are numerical techniques, we will not use strings as labels for the categorical variables. Therefore, for a better comprehension of the Project, we will state here the corresponding of each factor number to its actual string value. Logical features will not be factorised as they are handled as ones or zeroes when applying PCA and Factor Analysis techniques.

glimpse(heart)
## Rows: 920
## Columns: 14
## $ age                 <int> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56…
## $ sex                 <fct> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ chest_pain          <fct> 1, 0, 0, 3, 2, 2, 0, 0, 0, 0, 0, 2, 3, 2, 3, 3, 2,…
## $ rest_blood_pressure <int> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, …
## $ chol                <int> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, …
## $ blood_sugar         <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ restecg             <fct> 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0,…
## $ max_heart_rate      <int> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, …
## $ exang               <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRU…
## $ oldpeak             <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, …
## $ slope               <fct> 0, 1, 1, 0, 2, 2, 0, 2, 1, 0, 1, 1, 1, 2, 2, 2, 0,…
## $ ca                  <fct> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1,…
## $ thal                <fct> 1, 0, NA, 0, 0, 0, 0, 0, NA, NA, 1, 0, 1, NA, NA, …
## $ stage               <fct> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1,…

After this, our final working features will be defined as:

  • age -> int; Age of the patient in years.

  • sex -> fct; “Male” or “Female” -> (0, 1)

  • chest_pain -> fct; chest pain type ([“typical angina”, “atypical angina”, “non-anginal”, “asymptomatic”]) -> (0, 1, 2, 3)

    • “typical angina” = 0 -> type of chest pain that goes away when resting.

    • “atypical angina” = 1 -> type of chest pain that does not fit the typical pattern of angina.

    • “non-anginal” = 2 -> type of chest pain unrelated to heart conditions.

    • “asymptomatic” = 3 -> absence of chest pain or discomfort.

  • rest_blood_pressure -> int; resting blood pressure (in mm Hg on admission to the hospital).

    • Between 90 and 120 mm Hg is a normal value, above and below it is not. (This feature will be taken care of in the outlier’s section).
  • chol -> int; serum cholesterol in mg/dl.

    • Below 200 mg/dl is an appropiate value, above it is not. (This feature will be taken care of in the outlier’s section).
  • blood_sugar -> lgl; if fasting blood sugar > 120 mg/dl (True or False).

    • If True -> no possibility of having diabetes.

    • If False -> there are possibilities of having diabetes.

  • restecg -> fct; resting electrocardiographic results [“normal”, “stt abnormality”, “lv hypertrophy”] -> (0, 1, 2)

    • “normal” = 0 -> electrical activity of the heart is within the expected range.

    • “stt abnormality” = 1 -> electrical activity of the heart is showing ST-T abnormalities which can indicate various cardiac conditions.

    • “lv hypertrophy” = 2 -> electrical activity of the heart is showing Left Ventricular Hypertrophy which suggests that the left ventricle of the heart is thicker or larger than normal. This can be a sign of various heart conditions or high blood pressure.

  • max_heart_rate -> int; maximum heart rate achieved.

    • Usually related to the group of age. (This feature will be taken care of in the outlier’s section).
  • exang -> lgl; exercise-induced angina (True or False).

    • If True -> individual experienced some type of discomfort during physical activity.

    • If False -> individual did not experienced any discomfort while exercising.

  • oldpeak -> dbl; ST depression induced by exercise relative to rest.

    • Millimeters of depression of the heart. (This feature will be taken care of in the outlier’s section).
  • slope -> fct; the slope of the peak exercise ST segment (“downsloping”, “flat” and “upsloping”) -> (0, 1, 2)

    • “downsloping” = 0 -> negative slope during peak exercise, can indicate important heart conditions.

    • “flat” = 1 -> not showing significant elevation or depression, non-concerning for the patient.

    • “upsloping” = 2 -> positive slope during peak exercise, can indicate important heart conditions.

  • ca -> fct; number of major vessels (0-3) colored by fluoroscopy -> (0, 1, 2, 3)

  • thal -> fct; [“normal”; “fixed defect” and “reversible defect”] -> (0, 1, 2)

  • stage -> fct; the predicted attribute -> (0,1, 2, 3, 4).

    • 0 -> absence of significant heart conditions.

    • 1 -> early or mild stage of heart conditions, which implies minor heart issues or risk factors.

    • 2 -> moderate level of heart conditions, this stage suggests that the individual’s heart condition is becoming more relevant. Medical intervention may be necessary.

    • 3 -> more advanced or severe heart conditions, which require medical interventions, treatment or surgery..

    • 4 -> advanced heart-disease or conditions that may be life-threatening.

1.2 NA Values

The following step, handle NA values:

barplot(colMeans(is.na(heart)), 
        names.arg = names(colMeans(is.na(heart))), col = "skyblue",
        ylab = "Percentage of NAs", 
        main = "Missing Values in Heart Disease", 
        cex.names = 0.8, las = 2)

# We see that ca and thal variables have more than 60% of missing values, so we delete those
heart = heart %>% select(- ca, - thal)
#summary(heart)

# NAs in rest_blood_preassure; chol; max_heart_rate; and oldpeak seem to be easy to take care of
# However, in slope and restecg they are still more than a 15%
# Let's watch out for observations that have a considerable amount of NAs
amountNAs = rowSums(is.na(heart))
count_by_na = table(amountNAs)
percentage_by_na = prop.table(count_by_na) * 100
for (i in 0:max(amountNAs)) {
  cat("Percentage of rows with", i, "missing values:",
      percentage_by_na[i], "%\n")
}
## Percentage of rows with 0 missing values:  %
## Percentage of rows with 1 missing values: 49.78261 %
## Percentage of rows with 2 missing values: 32.06522 %
## Percentage of rows with 3 missing values: 10.54348 %
## Percentage of rows with 4 missing values: 1.413043 %
## Percentage of rows with 5 missing values: 0.326087 %
## Percentage of rows with 6 missing values: 2.608696 %
## Percentage of rows with 7 missing values: 3.152174 %
percentage_by_na[1] + percentage_by_na[2] + percentage_by_na[3]
##       0 
## 92.3913
# We see that more than 92% of the data is covered by observations that
# have less or equal than 3 missing values. Therefore, it might be a
# good idea to delete those rows with more than 4 NAs.

# However, let's see first if there is a pattern in the NA values
md.pattern(heart, plot = TRUE, rotate.names = TRUE)

##     age sex chest_pain stage chol max_heart_rate exang rest_blood_pressure
## 458   1   1          1     1    1              1     1                   1
## 162   1   1          1     1    1              1     1                   1
## 73    1   1          1     1    1              1     1                   1
## 48    1   1          1     1    1              1     1                   1
## 52    1   1          1     1    1              1     1                   1
## 12    1   1          1     1    1              1     1                   1
## 17    1   1          1     1    1              1     1                   1
## 5     1   1          1     1    1              1     1                   1
## 4     1   1          1     1    1              1     1                   1
## 2     1   1          1     1    1              1     1                   1
## 1     1   1          1     1    1              1     1                   1
## 1     1   1          1     1    1              1     1                   0
## 1     1   1          1     1    1              1     1                   0
## 1     1   1          1     1    1              1     1                   0
## 1     1   1          1     1    1              1     1                   0
## 1     1   1          1     1    1              0     0                   0
## 24    1   1          1     1    1              0     0                   0
## 27    1   1          1     1    1              0     0                   0
## 7     1   1          1     1    0              1     1                   1
## 18    1   1          1     1    0              1     1                   1
## 1     1   1          1     1    0              1     1                   1
## 1     1   1          1     1    0              1     1                   1
## 2     1   1          1     1    0              0     0                   0
## 1     1   1          1     1    0              0     0                   0
##       0   0          0     0   30             55    55                  59
##     oldpeak blood_sugar restecg slope    
## 458       1           1       1     1   0
## 162       1           1       1     0   1
## 73        1           1       0     1   1
## 48        1           1       0     0   2
## 52        1           0       1     1   1
## 12        1           0       1     0   2
## 17        1           0       0     1   2
## 5         1           0       0     0   3
## 4         0           1       0     0   3
## 2         0           0       1     0   3
## 1         0           0       0     0   4
## 1         1           1       1     1   1
## 1         1           1       0     1   2
## 1         1           1       0     0   3
## 1         0           0       1     0   4
## 1         1           1       0     1   4
## 24        0           1       1     0   5
## 27        0           1       0     0   6
## 7         1           1       1     1   1
## 18        1           1       1     0   2
## 1         1           1       0     1   2
## 1         1           1       0     0   3
## 2         0           1       1     0   6
## 1         0           1       0     0   7
##          62          90     181   309 841
# From the plot, we see that the pattern in the NA values is that
# whenever you have more than or exactly 4 missing values
# The NAs will be in the features rest_blood_preassure, max_heart_rate, exang, oldpeak and slope
# So, maybe these variables will affect possible statistical models by making the model noisy

# We will eliminate those rows with more than 4 (included) features
# with missing values
heart = heart[amountNAs <= 4, ]
#summary(heart)

# Despite the fact that we got rid of the majority of the missing values
# without loosing a relevant information, we still have got a
# considerable amount of NAs in the variable restecg. Let's check again
md.pattern(heart, plot = TRUE, rotate.names = TRUE)

##     age sex chest_pain stage max_heart_rate exang rest_blood_pressure oldpeak
## 458   1   1          1     1              1     1                   1       1
## 162   1   1          1     1              1     1                   1       1
## 73    1   1          1     1              1     1                   1       1
## 48    1   1          1     1              1     1                   1       1
## 52    1   1          1     1              1     1                   1       1
## 12    1   1          1     1              1     1                   1       1
## 17    1   1          1     1              1     1                   1       1
## 5     1   1          1     1              1     1                   1       1
## 7     1   1          1     1              1     1                   1       1
## 18    1   1          1     1              1     1                   1       1
## 1     1   1          1     1              1     1                   1       1
## 1     1   1          1     1              1     1                   1       1
## 4     1   1          1     1              1     1                   1       0
## 2     1   1          1     1              1     1                   1       0
## 1     1   1          1     1              1     1                   1       0
## 1     1   1          1     1              1     1                   0       1
## 1     1   1          1     1              1     1                   0       1
## 1     1   1          1     1              1     1                   0       1
## 1     1   1          1     1              1     1                   0       0
## 1     1   1          1     1              0     0                   0       1
##       0   0          0     0              1     1                   5       8
##     chol blood_sugar restecg slope    
## 458    1           1       1     1   0
## 162    1           1       1     0   1
## 73     1           1       0     1   1
## 48     1           1       0     0   2
## 52     1           0       1     1   1
## 12     1           0       1     0   2
## 17     1           0       0     1   2
## 5      1           0       0     0   3
## 7      0           1       1     1   1
## 18     0           1       1     0   2
## 1      0           1       0     1   2
## 1      0           1       0     0   3
## 4      1           1       0     0   3
## 2      1           0       1     0   3
## 1      1           0       0     0   4
## 1      1           1       1     1   1
## 1      1           1       0     1   2
## 1      1           1       0     0   3
## 1      1           0       1     0   4
## 1      1           1       0     1   4
##       27          90     153   255 540
summary(heart) 
##       age        sex     chest_pain rest_blood_pressure      chol      
##  Min.   :28.00   0:673   0:470      Min.   :  0.0       Min.   :  0.0  
##  1st Qu.:46.00   1:193   1: 42      1st Qu.:120.0       1st Qu.:174.2  
##  Median :54.00           2:168      Median :130.0       Median :223.0  
##  Mean   :53.13           3:186      Mean   :132.1       Mean   :199.0  
##  3rd Qu.:60.00                      3rd Qu.:140.0       3rd Qu.:268.0  
##  Max.   :77.00                      Max.   :200.0       Max.   :603.0  
##  blood_sugar     restecg max_heart_rate    exang            oldpeak       
##  Mode :logical   0:663   Min.   : 60.0   Mode :logical   Min.   :-2.6000  
##  FALSE:740       1:  0   1st Qu.:120.0   FALSE:529       1st Qu.: 0.0000  
##  TRUE :126       2:203   Median :140.0   TRUE :337       Median : 0.5000  
##                          Mean   :137.6                   Mean   : 0.8781  
##                          3rd Qu.:157.0                   3rd Qu.: 1.5000  
##                          Max.   :202.0                   Max.   : 6.2000  
##  slope   stage  
##  0: 74   0:392  
##  1:448   1:252  
##  2:344   2:102  
##          3: 94  
##          4: 26  
## 
# Even though we chose this, we may consider eliminating those two 
# 'problematic' variables

1.3 Outliers

Our data set to work on now has no NA values, however, the values themselves may be incorrect or are not coherent. Hence, treating outliers is going to be crucial for our study. From the summary of the data set, we see that our variables of interest will be: age, rest_blood_pressure, chol, max_heart_rate and oldpeak.

# 3-sigma rule
mu = mean(heart$age)
sigma = sd(heart$age)
sum(heart$age < mu - 3 * sigma | heart$age > mu + 3 * sigma)
## [1] 0
# No 'big' outliers for the age

mu = mean(heart$rest_blood_pressure)
sigma = sd(heart$rest_blood_pressure)
sum(heart$rest_blood_pressure < mu - 3 * sigma |
      heart$rest_blood_pressure > mu + 3 * sigma)
## [1] 8
# 8 'big' outliers for the restblood_preassure

mu = mean(heart$chol)
sigma = sd(heart$chol)
sum(heart$chol < mu - 3 * sigma | heart$chol > mu + 3 * sigma)
## [1] 2
# 2 'big' outliers for the chol

mu = mean(heart$max_heart_rate)
sigma = sd(heart$max_heart_rate)
sum(heart$max_heart_rate < mu - 3 * sigma | 
      heart$max_heart_rate > mu + 3 * sigma)
## [1] 0
# No 'big' outliers for the max_heart_rate

mu = mean(heart$oldpeak)
sigma = sd(heart$oldpeak)
sum(heart$oldpeak < mu - 3 * sigma | heart$oldpeak > mu + 3 * sigma)
## [1] 7
# 7 'big' outliers for the oldpeak

# Next approach (rest_blood_pressure, chol, max_heart_rate and oldpeak)
# Inter Quantile Range
QI = quantile(heart$rest_blood_pressure, 0.25)
QS = quantile(heart$rest_blood_pressure, 0.75)
IQR = QS - QI
sum(heart$rest_blood_pressure < QI - 1.5*IQR | 
      heart$rest_blood_pressure > QS + 1.5*IQR)
## [1] 28
# 28 values out of the IQR for rest_blood_pressure

QI = quantile(heart$chol, 0.25)
QS = quantile(heart$chol, 0.75)
IQR = QS - QI
sum(heart$chol < QI - 1.5*IQR | 
      heart$chol > QS + 1.5*IQR)
## [1] 180
# 180 values out of the IQR for chol

QI = quantile(heart$max_heart_rate, 0.25)
QS = quantile(heart$max_heart_rate, 0.75)
IQR = QS - QI
sum(heart$max_heart_rate < QI - 1.5*IQR | 
      heart$max_heart_rate > QS + 1.5*IQR)
## [1] 2
# 2 values out of the IQR for max_heart_rate

QI = quantile(heart$oldpeak, 0.25)
QS = quantile(heart$oldpeak, 0.75)
IQR = QS - QI
sum(heart$oldpeak < QI - 1.5*IQR | 
      heart$oldpeak > QS + 1.5*IQR)
## [1] 16
# 16 values out of the IQR for oldpeak

# Let's see that graphically
g1 = ggplot(data = heart) + aes(x = rest_blood_pressure, y = 1) +
     geom_boxplot(fill = "lightblue", color = "blue", 
                  outlier.color = "red", outlier.shape = 16) +
     xlab("Rest Blood Pressure") + theme_minimal() + 
     geom_jitter(alpha = 0.7)

# With this first boxplot, we realised that the feature 
# rest_blood_pressure had strong outliers below 90 and above 180; with
# a low variability (without those values)
g1

g2 = ggplot(data = heart) + aes(x = chol, y  = 1) +
     geom_boxplot(fill = "lightblue", color = "blue", 
                  outlier.color = "red", outlier.shape = 16) +
     xlab("Cholesterol") + theme_minimal() + geom_jitter(alpha = 0.7)

# Thanks to the second boxplot, we found values that did not make sense
# chol as 0 or above 400; without these values it has a low variability 
g2

g3 = ggplot(data = heart) + aes(x = max_heart_rate, y = 1) +
     geom_boxplot(fill = "lightblue", color = "blue", 
                  outlier.color = "red", outlier.shape = 16) +
     xlab("Max Heart Rate") + theme_minimal() + geom_jitter(alpha = 0.7)

# Due to the geom_jitter() we saw that the values might follow a normal
# distribution, in which outliers did not have a huge impact as long as
# the sample size is sufficiently large. For the feature max_heart_rate
g3

g4 = ggplot(data = heart) + aes(x = oldpeak, y = 1) +
     geom_boxplot(fill = "lightblue", color = "blue", 
                  outlier.color = "red", outlier.shape = 16) +
     xlab("Old Peak") + theme_minimal() + geom_jitter(alpha = 0.7)

# In this fourth geom_jitter() we saw that the distribution of oldpeak
# was strongly skewed, as a considerable amount of values lied at 0.
# This gave us the reckoning that no matter the outliers, the
# the distribution of the variable is not going to be affected
g4

After seeing which values seem to get out of “normal” bounds, we must study those variables thoroughly.

  • rest_blood_pressure:

    • Abnormally low blood pressure < 90 mm Hg (life-threatening)

    • 90 mm Hg <= Normal Blood Pressure<= 120 mm Hg

    • 120 mm Hg < Elevated Blood Pressure <=129 mm Hg

    • 130 mm Hg <= Hypertension Stage 1 <= 139 mm Hg

    • 140 mm Hg <= Hypertension Stage 2 < 180 mm Hg

    • Hypertensive Crisis >= 180 mm Hg (life-threatening)

      We believe that the universities did not took the measurements of patients in a risk of death. If those measurements were to be true, they would be dead by now. So we reckon that those values must be discarded, as they lack of coherence.

  • chol:

    • Desirable Total Cholesterol < 200 mg/dl

    • 200 mg/dl <= Borderline High Total Cholesterol <= 239 mg/dl

    • 240 mg/dl <= High Total Cholesterol > 400 mg/dl

      From the plot we saw that there are a lot of conglomerated values at 0, which make no sense, as it is impossible that a person has 0 mg/dl of serum cholesterol. This makes us believe that those values as 0, represent some kind of NA or a measurement error. Therefore, handling those values is crucial. Another thing to mention are the extreme values of total cholesterol. Values above 400 mg/dl are a huge health risk and may be errors, however checking the stage on those values to see if they make sense will be a good idea.

      So in conclusion for this feature, the problem is the high amount of 0 values and seeing if values above 400 are coherent.

  • max_heart_rate: the maximum heart rate is quite related to the age. As it follows, when there are no exceptions the next ranges depending on age approximately.

    • 140 to 150 bpm -> Adults 70 and older

    • 150 to 160 bpm -> Adults 60 to 69

    • 160 to 170 bpm -> Adults 50 to 59

    • 170 to 180 bpm -> Adults 40 to 49

    • 180 to 190 bpm -> Adults 30 to 39

    • 190 to 200 bpm -> Adults 20 to 29

      There are a small amount of values below 120, we may have to see the distribution of this variable, which seems to be normal based on the small difference between the Median and the Mean. So maybe we should not eliminate outliers here as they are not affecting heavily the distribution of the data. Nevertheless a graphical study of the distribution will verify if this assumption is true or not.

  • oldpeak: has only 7 outliers from the plot and the summary we see that has a right skewed distribution, heavily produced due to the high amount of values at 0. So eliminating here outliers does not seem a good idea. However, a graphical check of the distribution will be a good way to check our assumption. Taking absolute values we have (note that we can take absolute values as it as measurement in millimeters which means that a negative value shows that the depression was caused in the other sense of the point of reference):

    • Normal or No ST Depression: oldpeak value of 0 (typically considered no ST depression).

    • Mild ST Depression: oldpeak values from approximately 0.5 to 1.0 millimeters.

    • Moderate ST Depression: oldpeak values from approximately 1.1 to 2.0 millimeters.

    • Severe ST Depression: oldpeak values greater than 2.0 millimeters.

summary(heart)    # to check again the values
##       age        sex     chest_pain rest_blood_pressure      chol      
##  Min.   :28.00   0:673   0:470      Min.   :  0.0       Min.   :  0.0  
##  1st Qu.:46.00   1:193   1: 42      1st Qu.:120.0       1st Qu.:174.2  
##  Median :54.00           2:168      Median :130.0       Median :223.0  
##  Mean   :53.13           3:186      Mean   :132.1       Mean   :199.0  
##  3rd Qu.:60.00                      3rd Qu.:140.0       3rd Qu.:268.0  
##  Max.   :77.00                      Max.   :200.0       Max.   :603.0  
##  blood_sugar     restecg max_heart_rate    exang            oldpeak       
##  Mode :logical   0:663   Min.   : 60.0   Mode :logical   Min.   :-2.6000  
##  FALSE:740       1:  0   1st Qu.:120.0   FALSE:529       1st Qu.: 0.0000  
##  TRUE :126       2:203   Median :140.0   TRUE :337       Median : 0.5000  
##                          Mean   :137.6                   Mean   : 0.8781  
##                          3rd Qu.:157.0                   3rd Qu.: 1.5000  
##                          Max.   :202.0                   Max.   : 6.2000  
##  slope   stage  
##  0: 74   0:392  
##  1:448   1:252  
##  2:344   2:102  
##          3: 94  
##          4: 26  
## 
# rest_blood_pressure OUTLIERS
# Outliers that make no sense
sum(heart$rest_blood_pressure < 90)     # only 2
## [1] 2
sum(heart$rest_blood_pressure >= 180)   # 20 values
## [1] 20
# Let's check those values
indeces1 = which((heart$rest_blood_pressure < 90) | 
                  (heart$rest_blood_pressure >= 180))
print(heart[indeces1, ])
##     age sex chest_pain rest_blood_pressure chol blood_sugar restecg
## 84   68   0          3                 180  274        TRUE       2
## 127  56   1          0                 200  288        TRUE       2
## 189  54   0          2                 192  283       FALSE       2
## 202  64   1          0                 180  325       FALSE       0
## 232  55   1          0                 180  327       FALSE       0
## 339  39   0          2                 190  241       FALSE       0
## 376  45   1          2                 180  166       FALSE       0
## 388  46   0          0                 180  280       FALSE       0
## 476  57   1          0                 180  347       FALSE       0
## 485  59   0          3                 180  213       FALSE       0
## 549  54   0          0                 200  198       FALSE       0
## 570  53   0          0                 180  285       FALSE       0
## 596  58   1          2                 180  393       FALSE       0
## 645  53   0          0                  80    0       FALSE       0
## 648  54   0          0                 180    0       FALSE       0
## 681  61   0          3                 200    0        TRUE       0
## 701  63   0          0                 185    0       FALSE       0
## 702  64   1          0                 200    0       FALSE       0
## 727  60   0          3                 180    0       FALSE       2
## 748  55   0          3                   0    0       FALSE       0
## 841  57   0          2                 180  285        TRUE       0
## 847  61   0          0                 190  287        TRUE       2
##     max_heart_rate exang oldpeak slope stage
## 84             150  TRUE     1.6     1     3
## 127            133  TRUE     4.0     0     3
## 189            195 FALSE     0.0     2     1
## 202            154  TRUE     0.0     2     0
## 232            117  TRUE     3.4     1     2
## 339            106 FALSE     0.0     1     0
## 376            180 FALSE     0.0     2     0
## 388            120 FALSE     0.0     1     0
## 476            126  TRUE     0.8     1     0
## 485            100 FALSE     0.0     2     0
## 549            142  TRUE     2.0     1     1
## 570            120  TRUE     1.5     1     1
## 596            110  TRUE     1.0     1     1
## 645            141  TRUE     2.0     0     0
## 648            150 FALSE     1.5     1     1
## 681             70 FALSE     0.0     2     3
## 701             98  TRUE     0.0     2     1
## 702            140  TRUE     1.0     1     3
## 727            140  TRUE     1.5     1     0
## 748            155 FALSE     1.5     1     3
## 841            120 FALSE     0.8     1     1
## 847            150  TRUE     2.0     0     4
# As below 90 there are just 2 observations, we will eliminate those
heart = heart[heart$rest_blood_pressure >= 90, ]
# Now our indeces will be
indeces1 = which(heart$rest_blood_pressure >= 180)
print(heart[indeces1, ])
##     age sex chest_pain rest_blood_pressure chol blood_sugar restecg
## 84   68   0          3                 180  274        TRUE       2
## 127  56   1          0                 200  288        TRUE       2
## 189  54   0          2                 192  283       FALSE       2
## 202  64   1          0                 180  325       FALSE       0
## 232  55   1          0                 180  327       FALSE       0
## 339  39   0          2                 190  241       FALSE       0
## 376  45   1          2                 180  166       FALSE       0
## 388  46   0          0                 180  280       FALSE       0
## 476  57   1          0                 180  347       FALSE       0
## 485  59   0          3                 180  213       FALSE       0
## 549  54   0          0                 200  198       FALSE       0
## 570  53   0          0                 180  285       FALSE       0
## 596  58   1          2                 180  393       FALSE       0
## 648  54   0          0                 180    0       FALSE       0
## 681  61   0          3                 200    0        TRUE       0
## 701  63   0          0                 185    0       FALSE       0
## 702  64   1          0                 200    0       FALSE       0
## 727  60   0          3                 180    0       FALSE       2
## 841  57   0          2                 180  285        TRUE       0
## 847  61   0          0                 190  287        TRUE       2
##     max_heart_rate exang oldpeak slope stage
## 84             150  TRUE     1.6     1     3
## 127            133  TRUE     4.0     0     3
## 189            195 FALSE     0.0     2     1
## 202            154  TRUE     0.0     2     0
## 232            117  TRUE     3.4     1     2
## 339            106 FALSE     0.0     1     0
## 376            180 FALSE     0.0     2     0
## 388            120 FALSE     0.0     1     0
## 476            126  TRUE     0.8     1     0
## 485            100 FALSE     0.0     2     0
## 549            142  TRUE     2.0     1     1
## 570            120  TRUE     1.5     1     1
## 596            110  TRUE     1.0     1     1
## 648            150 FALSE     1.5     1     1
## 681             70 FALSE     0.0     2     3
## 701             98  TRUE     0.0     2     1
## 702            140  TRUE     1.0     1     3
## 727            140  TRUE     1.5     1     0
## 841            120 FALSE     0.8     1     1
## 847            150  TRUE     2.0     0     4
# For the rest 20 values that have a rest_blood_pressure above 180
# we will study them later

# chol OUTLIERS
sum(heart$chol == 0)      # huge amount (165 values)
## [1] 166
sum(heart$chol > 400)     # very few (13 values)
## [1] 14
# Let's check those values
indeces2_1 = which((heart$chol == 0))
indeces2_2 = which((heart$chol > 400))
print(heart[indeces2_1, ])
##     age sex chest_pain rest_blood_pressure chol blood_sugar restecg
## 335  39   0          2                 120    0       FALSE       2
## 401  48   0          2                 100    0       FALSE       0
## 530  38   0          0                 110    0       FALSE       0
## 543  52   0          0                 170    0       FALSE       0
## 548  54   0          0                 140    0       FALSE       0
## 582  66   0          0                 140    0       FALSE       0
## 598  32   0          1                  95    0       FALSE       0
## 599  34   0          0                 115    0       FALSE       2
## 600  35   0          0                 112    0       FALSE       0
## 601  36   0          0                 110    0        TRUE       0
## 602  38   1          0                 105    0       FALSE       0
## 603  38   1          0                 110    0       FALSE       0
## 604  38   0          3                 100    0       FALSE       0
## 605  38   0          3                 115    0       FALSE       0
## 606  38   0          0                 135    0       FALSE       0
## 607  38   0          0                 150    0       FALSE       0
## 608  40   0          0                  95    0       FALSE       0
## 609  41   0          0                 125    0       FALSE       0
## 610  42   0          0                 105    0       FALSE       0
## 611  42   0          0                 145    0       FALSE       0
## 612  43   0          0                 100    0       FALSE       0
## 613  43   0          0                 115    0       FALSE       0
## 614  43   0          0                 140    0       FALSE       0
## 615  45   0          3                 110    0       FALSE       0
## 616  46   0          0                 100    0       FALSE       0
## 617  46   0          0                 115    0       FALSE       0
## 618  47   0          3                 110    0       FALSE       0
## 619  47   0          3                 155    0       FALSE       0
## 620  47   0          0                 110    0       FALSE       0
## 621  47   0          0                 160    0       FALSE       0
## 622  48   0          0                 115    0       FALSE       0
## 623  50   1          0                 160    0       FALSE       0
## 624  50   0          0                 115    0       FALSE       0
## 625  50   0          0                 120    0       FALSE       0
## 626  50   0          0                 145    0       FALSE       0
## 627  51   1          0                 120    0        TRUE       0
## 628  51   0          0                 110    0       FALSE       0
## 629  51   0          0                 120    0        TRUE       0
## 630  51   0          0                 130    0       FALSE       0
## 631  51   0          0                 130    0       FALSE       0
## 632  51   0          0                 140    0       FALSE       0
## 633  51   0          0                  95    0       FALSE       0
## 634  52   0          0                 130    0       FALSE       0
## 635  52   0          0                 135    0       FALSE       0
## 636  52   0          0                 165    0       FALSE       0
## 637  52   0          0                  95    0       FALSE       0
## 638  53   0          2                 120    0       FALSE       0
## 639  53   0          2                 130    0       FALSE       0
## 640  53   0          3                 105    0       FALSE       0
## 641  53   0          3                 160    0       FALSE       2
## 642  53   0          0                 120    0       FALSE       0
## 643  53   0          0                 125    0       FALSE       0
## 644  53   0          0                 130    0       FALSE       2
## 646  54   0          0                 120    0       FALSE       0
## 647  54   0          0                 130    0       FALSE       0
## 648  54   0          0                 180    0       FALSE       0
## 649  55   0          2                 140    0       FALSE       0
## 650  55   0          0                 115    0       FALSE       0
## 651  55   0          0                 120    0       FALSE       0
## 652  55   0          0                 140    0       FALSE       0
## 653  56   0          3                 120    0       FALSE       0
## 654  56   0          3                 125    0       FALSE       0
## 655  56   0          3                 155    0       FALSE       0
## 656  56   0          0                 115    0       FALSE       0
## 657  56   0          0                 120    0       FALSE       2
## 658  56   0          0                 120    0       FALSE       0
## 659  56   0          0                 125    0        TRUE       0
## 660  56   0          0                 140    0       FALSE       0
## 661  57   0          3                 105    0       FALSE       0
## 662  57   0          0                 110    0       FALSE       0
## 663  57   0          0                 140    0       FALSE       0
## 664  57   0          0                 140    0       FALSE       0
## 665  57   0          0                 160    0        TRUE       0
## 666  57   0          0                  95    0       FALSE       0
## 667  58   0          0                 115    0       FALSE       0
## 668  58   0          0                 130    0       FALSE       0
## 669  58   0          0                 170    0       FALSE       0
## 670  59   0          3                 125    0       FALSE       0
## 671  59   0          0                 110    0        TRUE       0
## 672  59   0          0                 120    0       FALSE       0
## 673  59   0          0                 125    0       FALSE       0
## 674  59   0          0                 135    0       FALSE       0
## 675  60   0          3                 115    0       FALSE       0
## 676  60   0          0                 125    0       FALSE       0
## 677  60   0          0                 130    0       FALSE       0
## 678  60   0          0                 135    0       FALSE       0
## 679  60   0          0                 160    0       FALSE       0
## 680  60   0          0                 160    0        TRUE       0
## 681  61   0          3                 200    0        TRUE       0
## 682  61   0          0                 105    0       FALSE       0
## 683  61   0          0                 110    0       FALSE       0
## 684  61   0          0                 125    0       FALSE       0
## 685  61   0          0                 130    0       FALSE       2
## 686  61   0          0                 130    0        TRUE       0
## 687  61   0          0                 150    0       FALSE       0
## 688  61   0          0                 150    0       FALSE       0
## 689  61   0          0                 160    0        TRUE       0
## 690  62   1          1                 140    0       FALSE       0
## 691  62   1          0                 120    0       FALSE       0
## 692  62   0          1                 120    0       FALSE       2
## 693  62   0          3                 160    0       FALSE       0
## 694  62   0          0                 115    0       FALSE       0
## 695  62   0          0                 115    0       FALSE       0
## 696  62   0          0                 150    0        TRUE       0
## 697  63   0          0                 100    0       FALSE       0
## 698  63   0          0                 140    0       FALSE       2
## 699  63   0          0                 150    0       FALSE       0
## 700  63   0          0                 150    0       FALSE       0
## 701  63   0          0                 185    0       FALSE       0
## 702  64   1          0                 200    0       FALSE       0
## 703  64   1          0                  95    0       FALSE       0
## 704  64   0          0                 110    0       FALSE       0
## 705  65   0          0                 115    0       FALSE       0
## 706  65   0          0                 145    0       FALSE       0
## 707  65   0          0                 155    0       FALSE       0
## 708  65   0          0                 160    0        TRUE       0
## 709  66   1          0                 155    0       FALSE       0
## 710  66   0          0                 150    0       FALSE       0
## 711  67   0          1                 145    0       FALSE       2
## 712  68   0          0                 135    0       FALSE       0
## 713  68   0          0                 145    0       FALSE       0
## 714  69   0          0                 135    0       FALSE       0
## 715  70   0          0                 115    0       FALSE       0
## 716  70   0          0                 140    0        TRUE       0
## 717  72   0          3                 160    0       FALSE       2
## 718  73   1          3                 160    0       FALSE       0
## 719  74   0          2                 145    0       FALSE       0
## 725  66   0          3                 120    0       FALSE       0
## 727  60   0          3                 180    0       FALSE       2
## 728  60   0          3                 120    0        TRUE       0
## 731  59   0          0                 140    0       FALSE       0
## 732  62   0          0                 110    0       FALSE       0
## 733  57   0          0                 128    0        TRUE       0
## 737  63   0          0                 126    0       FALSE       2
## 738  60   0          0                 152    0       FALSE       0
## 739  58   0          0                 116    0       FALSE       0
## 740  64   0          0                 120    0        TRUE       0
## 741  63   0          3                 130    0       FALSE       0
## 742  52   0          3                 128    0       FALSE       2
## 743  69   0          0                 130    0        TRUE       0
## 749  52   0          3                 122    0       FALSE       0
## 750  64   0          0                 144    0       FALSE       0
## 751  60   0          0                 120    0       FALSE       0
## 752  59   0          0                 154    0       FALSE       0
## 753  61   0          3                 120    0       FALSE       0
## 754  40   0          0                 125    0        TRUE       0
## 755  61   0          0                 125    0        TRUE       0
## 756  41   0          0                 104    0       FALSE       0
## 757  63   0          0                 136    0       FALSE       0
## 759  51   0          0                 128    0       FALSE       0
## 760  59   0          3                 125    0       FALSE       0
## 762  55   0          3                 120    0       FALSE       0
## 765  53   0          0                 126    0       FALSE       0
## 766  68   0          0                 138    0       FALSE       0
## 767  53   0          0                 154    0       FALSE       0
## 768  59   0          0                 178    0        TRUE       2
## 769  61   0          0                 110    0       FALSE       0
## 771  56   0          3                 170    0       FALSE       2
## 772  58   0          2                 126    0        TRUE       0
## 773  69   0          3                 140    0       FALSE       0
## 775  58   0          0                 120    0       FALSE       2
## 781  49   0          1                 130    0       FALSE       0
## 794  67   0          0                 120    0        TRUE       0
## 798  43   0          0                 122    0       FALSE       0
## 799  63   0          3                 130    0        TRUE       0
## 802  48   0          3                 102    0       FALSE       0
##     max_heart_rate exang oldpeak slope stage
## 335            146 FALSE     2.0     2     0
## 401            100 FALSE     0.0     1     0
## 530            150  TRUE     1.0     1     1
## 543            126  TRUE     1.5     1     1
## 548            118  TRUE     0.0     1     1
## 582             94  TRUE     1.0     1     1
## 598            127 FALSE     0.7     2     1
## 599            154 FALSE     0.2     2     1
## 600            130  TRUE     1.4     0     3
## 601            125  TRUE     1.0     1     1
## 602            166 FALSE     2.8     2     2
## 603            156 FALSE     0.0     1     1
## 604            179 FALSE    -1.1     2     0
## 605            128  TRUE     0.0     1     1
## 606            150 FALSE     0.0     2     2
## 607            120  TRUE     2.0     1     1
## 608            144 FALSE     0.0     2     2
## 609            176 FALSE     1.6     2     2
## 610            128  TRUE    -1.5     0     1
## 611             99  TRUE     0.0     1     2
## 612            122 FALSE     1.5     0     3
## 613            145  TRUE     2.0     1     4
## 614            140  TRUE     0.5     2     2
## 615            138 FALSE    -0.1     2     0
## 616            133 FALSE    -2.6     1     1
## 617            113  TRUE     1.5     1     1
## 618            120  TRUE     0.0     1     1
## 619            118  TRUE     1.0     1     3
## 620            149 FALSE     2.1     2     2
## 621            124  TRUE     0.0     1     1
## 622            128 FALSE     0.0     1     2
## 623            110 FALSE     0.0     0     1
## 624            120  TRUE     0.5     1     3
## 625            156  TRUE     0.0     2     3
## 626            139  TRUE     0.7     1     1
## 627            127  TRUE     1.5     2     2
## 628             92 FALSE     0.0     1     4
## 629            104 FALSE     0.0     1     3
## 630            170 FALSE    -0.7     2     2
## 631            163 FALSE     0.0     2     1
## 632             60 FALSE     0.0     1     2
## 633            126 FALSE     2.2     1     2
## 634            120 FALSE     0.0     1     2
## 635            128  TRUE     2.0     1     2
## 636            122  TRUE     1.0     2     2
## 637             82  TRUE     1.5     1     2
## 638             95 FALSE     0.0     1     3
## 639            120 FALSE     0.7     0     0
## 640            115 FALSE     0.0     1     1
## 641            122  TRUE     0.0     2     1
## 642            120 FALSE     0.0     1     1
## 643            120 FALSE     1.5     2     4
## 644            135  TRUE     1.0     1     2
## 646            155 FALSE     0.0     1     2
## 647            110  TRUE     3.0     1     3
## 648            150 FALSE     1.5     1     1
## 649            150 FALSE     0.2     2     0
## 650            155 FALSE     0.1     1     1
## 651             92 FALSE     0.3     2     4
## 652             83 FALSE     0.0     1     2
## 653             97 FALSE     0.0     1     0
## 654             98 FALSE    -2.0     1     2
## 655             99 FALSE     0.0     1     2
## 656             82 FALSE    -1.0     2     1
## 657            100  TRUE    -1.0     0     2
## 658            148 FALSE     0.0     1     2
## 659            103  TRUE     1.0     1     3
## 660            121  TRUE     1.8     2     1
## 661            148 FALSE     0.3     1     1
## 662            131  TRUE     1.4     2     3
## 663            120  TRUE     2.0     1     2
## 664            100  TRUE     0.0     1     3
## 665             98  TRUE     2.0     1     2
## 666            182 FALSE     0.7     0     1
## 667            138 FALSE     0.5     2     1
## 668            100  TRUE     1.0     1     4
## 669            105  TRUE     0.0     1     1
## 670            175 FALSE     2.6     1     1
## 671             94 FALSE     0.0     1     3
## 672            115 FALSE     0.0     1     2
## 673            119  TRUE     0.9     2     1
## 674            115  TRUE     1.0     1     1
## 675            143 FALSE     2.4     2     1
## 676            110 FALSE     0.1     2     3
## 677            130  TRUE     1.1     0     1
## 678             63  TRUE     0.5     2     3
## 679             99  TRUE     0.5     1     3
## 680            149 FALSE     0.4     1     1
## 681             70 FALSE     0.0     2     3
## 682            110  TRUE     1.5     2     1
## 683            113 FALSE     1.4     1     1
## 684            105  TRUE     0.0     0     3
## 685            115 FALSE     0.0     1     3
## 686             77 FALSE     2.5     1     3
## 687            105  TRUE     0.0     1     1
## 688            117  TRUE     2.0     1     2
## 689            145 FALSE     1.0     1     2
## 690            143 FALSE     0.0     1     2
## 691            123  TRUE     1.7     0     1
## 692            134 FALSE    -0.8     1     1
## 693             72  TRUE     0.0     1     3
## 694            128  TRUE     2.5     0     2
## 695             72  TRUE    -0.5     1     1
## 696             78 FALSE     2.0     1     3
## 697            109 FALSE    -0.9     1     1
## 698            149 FALSE     2.0     2     2
## 699             86  TRUE     2.0     1     3
## 700            154 FALSE     3.7     2     3
## 701             98  TRUE     0.0     2     1
## 702            140  TRUE     1.0     1     3
## 703            145 FALSE     1.1     0     1
## 704            114  TRUE     1.3     0     1
## 705             93  TRUE     0.0     1     1
## 706             67 FALSE     0.0     1     3
## 707            154 FALSE     1.0     2     0
## 708            122 FALSE     1.5     0     3
## 709             90 FALSE     0.0     1     1
## 710            108  TRUE     2.0     1     3
## 711            125 FALSE     0.0     1     2
## 712            120  TRUE     0.0     2     3
## 713            136 FALSE     1.8     2     1
## 714            130 FALSE     0.0     1     1
## 715             92  TRUE     0.0     1     1
## 716            157  TRUE     2.0     1     3
## 717            114 FALSE     1.6     1     0
## 718            121 FALSE     0.0     2     1
## 719            123 FALSE     1.3     2     1
## 725            120 FALSE    -0.5     2     0
## 727            140  TRUE     1.5     1     0
## 728            141  TRUE     2.0     2     3
## 731            117  TRUE     1.0     1     1
## 732            120  TRUE     0.5     1     1
## 733            148  TRUE     1.0     1     1
## 737            120 FALSE     1.5     0     0
## 738            118  TRUE     0.0     1     0
## 739            124 FALSE     1.0     2     2
## 740            106 FALSE     2.0     1     1
## 741            111  TRUE     0.0     1     3
## 742            180 FALSE     3.0     2     2
## 743            129 FALSE     1.0     1     2
## 749            110  TRUE     2.0     0     2
## 750            122  TRUE     1.0     1     3
## 751            133  TRUE     2.0     2     0
## 752            131  TRUE     1.5     0     0
## 753             80  TRUE     0.0     1     3
## 754            165 FALSE     0.0     1     1
## 755             86 FALSE     1.5     1     3
## 756            111 FALSE     0.0     1     0
## 757             84  TRUE     0.0     2     2
## 759            107 FALSE     0.0     1     0
## 760            128  TRUE     2.0     0     2
## 762            125  TRUE     2.5     1     1
## 765            106 FALSE     0.0     2     1
## 766            130  TRUE     3.0     1     2
## 767            140  TRUE     1.5     1     2
## 768            120  TRUE     0.0     1     1
## 769            108  TRUE     2.0     0     2
## 771            123  TRUE     2.5     1     4
## 772            110  TRUE     2.0     1     2
## 773            118 FALSE     2.5     0     2
## 775            106  TRUE     1.5     0     1
## 781            145 FALSE     3.0     1     2
## 794            150 FALSE     1.5     0     3
## 798            120 FALSE     0.5     2     1
## 799            160 FALSE     3.0     1     0
## 802            110  TRUE     1.0     0     1
print(heart[indeces2_2, ])
##     age sex chest_pain rest_blood_pressure chol blood_sugar restecg
## 49   65   1          3                 140  417        TRUE       2
## 122  63   1          0                 150  407       FALSE       2
## 153  67   1          3                 115  564       FALSE       2
## 182  56   1          0                 134  409       FALSE       2
## 348  40   0          3                 140  412       FALSE       0
## 374  44   0          0                 150  412       FALSE       0
## 435  53   1          2                 113  468       FALSE       0
## 501  40   0          0                 120  466       FALSE       0
## 529  32   0          0                 118  529       FALSE       0
## 547  54   0          0                 130  603        TRUE       0
## 567  52   0          0                 140  404       FALSE       0
## 569  53   0          3                 145  518       FALSE       0
## 585  44   0          0                 135  491       FALSE       0
## 784  58   0          0                 132  458        TRUE       0
##     max_heart_rate exang oldpeak slope stage
## 49             157 FALSE     0.8     2     0
## 122            154 FALSE     4.0     1     4
## 153            160 FALSE     1.6     1     0
## 182            150  TRUE     1.9     1     2
## 348            188 FALSE     0.0     2     0
## 374            170 FALSE     0.0     2     0
## 435            127 FALSE     0.0     1     0
## 501            152  TRUE     1.0     1     1
## 529            130 FALSE     0.0     2     1
## 547            125  TRUE     1.0     1     1
## 567            124  TRUE     2.0     1     1
## 569            130 FALSE     0.0     1     1
## 585            135 FALSE     0.0     1     1
## 784             69 FALSE     1.0     0     0
# chol over 400 seems to be in just a few cases in which the rest of
# the results are coherent and cohesive. Therefore, instead of deleting
# these obeservations a change of the chol by the mean of the chol
# that have their corresponding age group will be the approach
# groups of age -> NEW FEATURE
# 70 and older; 60 to 69; 50 to 59; 40 to 49; 30 to 39; 20 to 29
heart$ageGroups = numeric(nrow(heart))
for (i in 1:nrow(heart)){
  if (20 <= heart$age[i] && heart$age[i] <= 29){
    heart$ageGroups[i] = 0
  }else if (30 <= heart$age[i] && heart$age[i] <= 39){
    heart$ageGroups[i] = 1
  }else if (40 <= heart$age[i] && heart$age[i] <= 49){
    heart$ageGroups[i] = 2
  }else if (50 <= heart$age[i] && heart$age[i] <= 59){
    heart$ageGroups[i] = 3
  }else if (60 <= heart$age[i] && heart$age[i] <= 69){
    heart$ageGroups[i] = 4
  }else{
    heart$ageGroups[i] = 5
  }
}

# Then we factorise this new variable
heart$ageGroups = as.factor(heart$ageGroups)
min(heart$age[which((heart$chol > 400))]) # minimum age
## [1] 32
max(heart$age[which((heart$chol > 400))]) # maximum age
## [1] 67
# So we need to compute the mean of the group of ages 1, 2, 3 and 4
# when chol is != 0 and <= 400
mean1 = mean(heart$chol[heart$chol <= 400 & heart$chol != 0 &
                        heart$ageGroups == 1])
mean1
## [1] 230.0484
mean1 = ceiling(mean1)  # chol is an integer
mean1
## [1] 231
mean2 = mean(heart$chol[heart$chol <= 400 & heart$chol != 0 & 
                        heart$ageGroups == 2])
mean2
## [1] 238.75
mean2 = ceiling(mean2)  # chol is an integer
mean2
## [1] 239
mean3 = mean(heart$chol[heart$chol <= 400 & heart$chol != 0 & 
                        heart$ageGroups == 3])
mean3
## [1] 244.5591
mean3 = ceiling(mean3)  # chol is an integer
mean3
## [1] 245
mean4 = mean(heart$chol[heart$chol <= 400 & heart$chol != 0 & 
                        heart$ageGroups == 4])
mean4
## [1] 249.3985
mean4 = ceiling(mean4)  # chol is an integer
mean4
## [1] 250
# Then we change the value where chol > 400 by its corresponding mean
for (i in 1:length(indeces2_2)){
  index_to_use = indeces2_2[i]
  if (heart$ageGroups[index_to_use] == 1){
    heart$chol[index_to_use] = mean1
  }else if (heart$ageGroups[index_to_use] == 2){
    heart$chol[index_to_use] = mean2
  }else if (heart$ageGroups[index_to_use] == 3){
    heart$chol[index_to_use] = mean3
  }else{
    heart$chol[index_to_use] = mean4
  }
}

# Now the problem is in the observations that have chol as 0
# For chol = 0 we will be studying other variables and seek for intersections and relations later
print(heart[indeces2_1, ])
##     age sex chest_pain rest_blood_pressure chol blood_sugar restecg
## 335  39   0          2                 120    0       FALSE       2
## 401  48   0          2                 100    0       FALSE       0
## 530  38   0          0                 110    0       FALSE       0
## 543  52   0          0                 170    0       FALSE       0
## 548  54   0          0                 140    0       FALSE       0
## 582  66   0          0                 140    0       FALSE       0
## 598  32   0          1                  95    0       FALSE       0
## 599  34   0          0                 115    0       FALSE       2
## 600  35   0          0                 112    0       FALSE       0
## 601  36   0          0                 110    0        TRUE       0
## 602  38   1          0                 105    0       FALSE       0
## 603  38   1          0                 110    0       FALSE       0
## 604  38   0          3                 100    0       FALSE       0
## 605  38   0          3                 115    0       FALSE       0
## 606  38   0          0                 135    0       FALSE       0
## 607  38   0          0                 150    0       FALSE       0
## 608  40   0          0                  95    0       FALSE       0
## 609  41   0          0                 125    0       FALSE       0
## 610  42   0          0                 105    0       FALSE       0
## 611  42   0          0                 145    0       FALSE       0
## 612  43   0          0                 100    0       FALSE       0
## 613  43   0          0                 115    0       FALSE       0
## 614  43   0          0                 140    0       FALSE       0
## 615  45   0          3                 110    0       FALSE       0
## 616  46   0          0                 100    0       FALSE       0
## 617  46   0          0                 115    0       FALSE       0
## 618  47   0          3                 110    0       FALSE       0
## 619  47   0          3                 155    0       FALSE       0
## 620  47   0          0                 110    0       FALSE       0
## 621  47   0          0                 160    0       FALSE       0
## 622  48   0          0                 115    0       FALSE       0
## 623  50   1          0                 160    0       FALSE       0
## 624  50   0          0                 115    0       FALSE       0
## 625  50   0          0                 120    0       FALSE       0
## 626  50   0          0                 145    0       FALSE       0
## 627  51   1          0                 120    0        TRUE       0
## 628  51   0          0                 110    0       FALSE       0
## 629  51   0          0                 120    0        TRUE       0
## 630  51   0          0                 130    0       FALSE       0
## 631  51   0          0                 130    0       FALSE       0
## 632  51   0          0                 140    0       FALSE       0
## 633  51   0          0                  95    0       FALSE       0
## 634  52   0          0                 130    0       FALSE       0
## 635  52   0          0                 135    0       FALSE       0
## 636  52   0          0                 165    0       FALSE       0
## 637  52   0          0                  95    0       FALSE       0
## 638  53   0          2                 120    0       FALSE       0
## 639  53   0          2                 130    0       FALSE       0
## 640  53   0          3                 105    0       FALSE       0
## 641  53   0          3                 160    0       FALSE       2
## 642  53   0          0                 120    0       FALSE       0
## 643  53   0          0                 125    0       FALSE       0
## 644  53   0          0                 130    0       FALSE       2
## 646  54   0          0                 120    0       FALSE       0
## 647  54   0          0                 130    0       FALSE       0
## 648  54   0          0                 180    0       FALSE       0
## 649  55   0          2                 140    0       FALSE       0
## 650  55   0          0                 115    0       FALSE       0
## 651  55   0          0                 120    0       FALSE       0
## 652  55   0          0                 140    0       FALSE       0
## 653  56   0          3                 120    0       FALSE       0
## 654  56   0          3                 125    0       FALSE       0
## 655  56   0          3                 155    0       FALSE       0
## 656  56   0          0                 115    0       FALSE       0
## 657  56   0          0                 120    0       FALSE       2
## 658  56   0          0                 120    0       FALSE       0
## 659  56   0          0                 125    0        TRUE       0
## 660  56   0          0                 140    0       FALSE       0
## 661  57   0          3                 105    0       FALSE       0
## 662  57   0          0                 110    0       FALSE       0
## 663  57   0          0                 140    0       FALSE       0
## 664  57   0          0                 140    0       FALSE       0
## 665  57   0          0                 160    0        TRUE       0
## 666  57   0          0                  95    0       FALSE       0
## 667  58   0          0                 115    0       FALSE       0
## 668  58   0          0                 130    0       FALSE       0
## 669  58   0          0                 170    0       FALSE       0
## 670  59   0          3                 125    0       FALSE       0
## 671  59   0          0                 110    0        TRUE       0
## 672  59   0          0                 120    0       FALSE       0
## 673  59   0          0                 125    0       FALSE       0
## 674  59   0          0                 135    0       FALSE       0
## 675  60   0          3                 115    0       FALSE       0
## 676  60   0          0                 125    0       FALSE       0
## 677  60   0          0                 130    0       FALSE       0
## 678  60   0          0                 135    0       FALSE       0
## 679  60   0          0                 160    0       FALSE       0
## 680  60   0          0                 160    0        TRUE       0
## 681  61   0          3                 200    0        TRUE       0
## 682  61   0          0                 105    0       FALSE       0
## 683  61   0          0                 110    0       FALSE       0
## 684  61   0          0                 125    0       FALSE       0
## 685  61   0          0                 130    0       FALSE       2
## 686  61   0          0                 130    0        TRUE       0
## 687  61   0          0                 150    0       FALSE       0
## 688  61   0          0                 150    0       FALSE       0
## 689  61   0          0                 160    0        TRUE       0
## 690  62   1          1                 140    0       FALSE       0
## 691  62   1          0                 120    0       FALSE       0
## 692  62   0          1                 120    0       FALSE       2
## 693  62   0          3                 160    0       FALSE       0
## 694  62   0          0                 115    0       FALSE       0
## 695  62   0          0                 115    0       FALSE       0
## 696  62   0          0                 150    0        TRUE       0
## 697  63   0          0                 100    0       FALSE       0
## 698  63   0          0                 140    0       FALSE       2
## 699  63   0          0                 150    0       FALSE       0
## 700  63   0          0                 150    0       FALSE       0
## 701  63   0          0                 185    0       FALSE       0
## 702  64   1          0                 200    0       FALSE       0
## 703  64   1          0                  95    0       FALSE       0
## 704  64   0          0                 110    0       FALSE       0
## 705  65   0          0                 115    0       FALSE       0
## 706  65   0          0                 145    0       FALSE       0
## 707  65   0          0                 155    0       FALSE       0
## 708  65   0          0                 160    0        TRUE       0
## 709  66   1          0                 155    0       FALSE       0
## 710  66   0          0                 150    0       FALSE       0
## 711  67   0          1                 145    0       FALSE       2
## 712  68   0          0                 135    0       FALSE       0
## 713  68   0          0                 145    0       FALSE       0
## 714  69   0          0                 135    0       FALSE       0
## 715  70   0          0                 115    0       FALSE       0
## 716  70   0          0                 140    0        TRUE       0
## 717  72   0          3                 160    0       FALSE       2
## 718  73   1          3                 160    0       FALSE       0
## 719  74   0          2                 145    0       FALSE       0
## 725  66   0          3                 120    0       FALSE       0
## 727  60   0          3                 180    0       FALSE       2
## 728  60   0          3                 120    0        TRUE       0
## 731  59   0          0                 140    0       FALSE       0
## 732  62   0          0                 110    0       FALSE       0
## 733  57   0          0                 128    0        TRUE       0
## 737  63   0          0                 126    0       FALSE       2
## 738  60   0          0                 152    0       FALSE       0
## 739  58   0          0                 116    0       FALSE       0
## 740  64   0          0                 120    0        TRUE       0
## 741  63   0          3                 130    0       FALSE       0
## 742  52   0          3                 128    0       FALSE       2
## 743  69   0          0                 130    0        TRUE       0
## 749  52   0          3                 122    0       FALSE       0
## 750  64   0          0                 144    0       FALSE       0
## 751  60   0          0                 120    0       FALSE       0
## 752  59   0          0                 154    0       FALSE       0
## 753  61   0          3                 120    0       FALSE       0
## 754  40   0          0                 125    0        TRUE       0
## 755  61   0          0                 125    0        TRUE       0
## 756  41   0          0                 104    0       FALSE       0
## 757  63   0          0                 136    0       FALSE       0
## 759  51   0          0                 128    0       FALSE       0
## 760  59   0          3                 125    0       FALSE       0
## 762  55   0          3                 120    0       FALSE       0
## 765  53   0          0                 126    0       FALSE       0
## 766  68   0          0                 138    0       FALSE       0
## 767  53   0          0                 154    0       FALSE       0
## 768  59   0          0                 178    0        TRUE       2
## 769  61   0          0                 110    0       FALSE       0
## 771  56   0          3                 170    0       FALSE       2
## 772  58   0          2                 126    0        TRUE       0
## 773  69   0          3                 140    0       FALSE       0
## 775  58   0          0                 120    0       FALSE       2
## 781  49   0          1                 130    0       FALSE       0
## 794  67   0          0                 120    0        TRUE       0
## 798  43   0          0                 122    0       FALSE       0
## 799  63   0          3                 130    0        TRUE       0
## 802  48   0          3                 102    0       FALSE       0
##     max_heart_rate exang oldpeak slope stage ageGroups
## 335            146 FALSE     2.0     2     0         1
## 401            100 FALSE     0.0     1     0         2
## 530            150  TRUE     1.0     1     1         1
## 543            126  TRUE     1.5     1     1         3
## 548            118  TRUE     0.0     1     1         3
## 582             94  TRUE     1.0     1     1         4
## 598            127 FALSE     0.7     2     1         1
## 599            154 FALSE     0.2     2     1         1
## 600            130  TRUE     1.4     0     3         1
## 601            125  TRUE     1.0     1     1         1
## 602            166 FALSE     2.8     2     2         1
## 603            156 FALSE     0.0     1     1         1
## 604            179 FALSE    -1.1     2     0         1
## 605            128  TRUE     0.0     1     1         1
## 606            150 FALSE     0.0     2     2         1
## 607            120  TRUE     2.0     1     1         1
## 608            144 FALSE     0.0     2     2         2
## 609            176 FALSE     1.6     2     2         2
## 610            128  TRUE    -1.5     0     1         2
## 611             99  TRUE     0.0     1     2         2
## 612            122 FALSE     1.5     0     3         2
## 613            145  TRUE     2.0     1     4         2
## 614            140  TRUE     0.5     2     2         2
## 615            138 FALSE    -0.1     2     0         2
## 616            133 FALSE    -2.6     1     1         2
## 617            113  TRUE     1.5     1     1         2
## 618            120  TRUE     0.0     1     1         2
## 619            118  TRUE     1.0     1     3         2
## 620            149 FALSE     2.1     2     2         2
## 621            124  TRUE     0.0     1     1         2
## 622            128 FALSE     0.0     1     2         2
## 623            110 FALSE     0.0     0     1         3
## 624            120  TRUE     0.5     1     3         3
## 625            156  TRUE     0.0     2     3         3
## 626            139  TRUE     0.7     1     1         3
## 627            127  TRUE     1.5     2     2         3
## 628             92 FALSE     0.0     1     4         3
## 629            104 FALSE     0.0     1     3         3
## 630            170 FALSE    -0.7     2     2         3
## 631            163 FALSE     0.0     2     1         3
## 632             60 FALSE     0.0     1     2         3
## 633            126 FALSE     2.2     1     2         3
## 634            120 FALSE     0.0     1     2         3
## 635            128  TRUE     2.0     1     2         3
## 636            122  TRUE     1.0     2     2         3
## 637             82  TRUE     1.5     1     2         3
## 638             95 FALSE     0.0     1     3         3
## 639            120 FALSE     0.7     0     0         3
## 640            115 FALSE     0.0     1     1         3
## 641            122  TRUE     0.0     2     1         3
## 642            120 FALSE     0.0     1     1         3
## 643            120 FALSE     1.5     2     4         3
## 644            135  TRUE     1.0     1     2         3
## 646            155 FALSE     0.0     1     2         3
## 647            110  TRUE     3.0     1     3         3
## 648            150 FALSE     1.5     1     1         3
## 649            150 FALSE     0.2     2     0         3
## 650            155 FALSE     0.1     1     1         3
## 651             92 FALSE     0.3     2     4         3
## 652             83 FALSE     0.0     1     2         3
## 653             97 FALSE     0.0     1     0         3
## 654             98 FALSE    -2.0     1     2         3
## 655             99 FALSE     0.0     1     2         3
## 656             82 FALSE    -1.0     2     1         3
## 657            100  TRUE    -1.0     0     2         3
## 658            148 FALSE     0.0     1     2         3
## 659            103  TRUE     1.0     1     3         3
## 660            121  TRUE     1.8     2     1         3
## 661            148 FALSE     0.3     1     1         3
## 662            131  TRUE     1.4     2     3         3
## 663            120  TRUE     2.0     1     2         3
## 664            100  TRUE     0.0     1     3         3
## 665             98  TRUE     2.0     1     2         3
## 666            182 FALSE     0.7     0     1         3
## 667            138 FALSE     0.5     2     1         3
## 668            100  TRUE     1.0     1     4         3
## 669            105  TRUE     0.0     1     1         3
## 670            175 FALSE     2.6     1     1         3
## 671             94 FALSE     0.0     1     3         3
## 672            115 FALSE     0.0     1     2         3
## 673            119  TRUE     0.9     2     1         3
## 674            115  TRUE     1.0     1     1         3
## 675            143 FALSE     2.4     2     1         4
## 676            110 FALSE     0.1     2     3         4
## 677            130  TRUE     1.1     0     1         4
## 678             63  TRUE     0.5     2     3         4
## 679             99  TRUE     0.5     1     3         4
## 680            149 FALSE     0.4     1     1         4
## 681             70 FALSE     0.0     2     3         4
## 682            110  TRUE     1.5     2     1         4
## 683            113 FALSE     1.4     1     1         4
## 684            105  TRUE     0.0     0     3         4
## 685            115 FALSE     0.0     1     3         4
## 686             77 FALSE     2.5     1     3         4
## 687            105  TRUE     0.0     1     1         4
## 688            117  TRUE     2.0     1     2         4
## 689            145 FALSE     1.0     1     2         4
## 690            143 FALSE     0.0     1     2         4
## 691            123  TRUE     1.7     0     1         4
## 692            134 FALSE    -0.8     1     1         4
## 693             72  TRUE     0.0     1     3         4
## 694            128  TRUE     2.5     0     2         4
## 695             72  TRUE    -0.5     1     1         4
## 696             78 FALSE     2.0     1     3         4
## 697            109 FALSE    -0.9     1     1         4
## 698            149 FALSE     2.0     2     2         4
## 699             86  TRUE     2.0     1     3         4
## 700            154 FALSE     3.7     2     3         4
## 701             98  TRUE     0.0     2     1         4
## 702            140  TRUE     1.0     1     3         4
## 703            145 FALSE     1.1     0     1         4
## 704            114  TRUE     1.3     0     1         4
## 705             93  TRUE     0.0     1     1         4
## 706             67 FALSE     0.0     1     3         4
## 707            154 FALSE     1.0     2     0         4
## 708            122 FALSE     1.5     0     3         4
## 709             90 FALSE     0.0     1     1         4
## 710            108  TRUE     2.0     1     3         4
## 711            125 FALSE     0.0     1     2         4
## 712            120  TRUE     0.0     2     3         4
## 713            136 FALSE     1.8     2     1         4
## 714            130 FALSE     0.0     1     1         4
## 715             92  TRUE     0.0     1     1         5
## 716            157  TRUE     2.0     1     3         5
## 717            114 FALSE     1.6     1     0         5
## 718            121 FALSE     0.0     2     1         5
## 719            123 FALSE     1.3     2     1         5
## 725            120 FALSE    -0.5     2     0         4
## 727            140  TRUE     1.5     1     0         4
## 728            141  TRUE     2.0     2     3         4
## 731            117  TRUE     1.0     1     1         3
## 732            120  TRUE     0.5     1     1         4
## 733            148  TRUE     1.0     1     1         3
## 737            120 FALSE     1.5     0     0         4
## 738            118  TRUE     0.0     1     0         4
## 739            124 FALSE     1.0     2     2         3
## 740            106 FALSE     2.0     1     1         4
## 741            111  TRUE     0.0     1     3         4
## 742            180 FALSE     3.0     2     2         3
## 743            129 FALSE     1.0     1     2         4
## 749            110  TRUE     2.0     0     2         3
## 750            122  TRUE     1.0     1     3         4
## 751            133  TRUE     2.0     2     0         4
## 752            131  TRUE     1.5     0     0         3
## 753             80  TRUE     0.0     1     3         4
## 754            165 FALSE     0.0     1     1         2
## 755             86 FALSE     1.5     1     3         4
## 756            111 FALSE     0.0     1     0         2
## 757             84  TRUE     0.0     2     2         4
## 759            107 FALSE     0.0     1     0         3
## 760            128  TRUE     2.0     0     2         3
## 762            125  TRUE     2.5     1     1         3
## 765            106 FALSE     0.0     2     1         3
## 766            130  TRUE     3.0     1     2         4
## 767            140  TRUE     1.5     1     2         3
## 768            120  TRUE     0.0     1     1         3
## 769            108  TRUE     2.0     0     2         4
## 771            123  TRUE     2.5     1     4         3
## 772            110  TRUE     2.0     1     2         3
## 773            118 FALSE     2.5     0     2         4
## 775            106  TRUE     1.5     0     1         3
## 781            145 FALSE     3.0     1     2         2
## 794            150 FALSE     1.5     0     3         4
## 798            120 FALSE     0.5     2     1         2
## 799            160 FALSE     3.0     1     0         4
## 802            110  TRUE     1.0     0     1         2
# Let's study the percentage of observations that have 0 as chol
count_chol_0 = sum(heart$chol == 0)
total_observations = nrow(heart)
percentage_chol_0 = (count_chol_0 / total_observations) * 100
cat("Percentage of rows with chol equal to 0:", 
    percentage_chol_0, "%\n")
## Percentage of rows with chol equal to 0: 19.21296 %
# For the max_heart_rate we will see if our assumption was true
g5 = ggplot(data = heart) + aes(x = max_heart_rate, color = "orange") + 
     geom_density(alpha = 0.5, fill = "orange") +
     labs(title = "Distribution of the Max Heart Rate", 
          xlab = "Max Heart Rate")
g5

# We see that it tends to be a normal distribution skewed, so because of
# this, the IQR and the 3-sigma rule, we will not delete any values

# We will procceed the same for oldpeak
g6 = ggplot(data = heart) + aes(x = oldpeak, color = "navy") +
     geom_density(alpha = 0.5, fill = "navy")+
     labs(title = "Distribution of the Oldpeak", 
          xlab = "Oldpeak")
g6

# We see that our assumption was right, we do not need to eliminate
# oldpeak outliers
summary(heart)
##       age        sex     chest_pain rest_blood_pressure      chol      
##  Min.   :28.00   0:671   0:469      Min.   : 92.0       Min.   :  0.0  
##  1st Qu.:46.00   1:193   1: 42      1st Qu.:120.0       1st Qu.:175.0  
##  Median :54.00           2:168      Median :130.0       Median :223.0  
##  Mean   :53.13           3:185      Mean   :132.3       Mean   :195.8  
##  3rd Qu.:60.00                      3rd Qu.:140.0       3rd Qu.:265.0  
##  Max.   :77.00                      Max.   :200.0       Max.   :394.0  
##  blood_sugar     restecg max_heart_rate    exang            oldpeak      
##  Mode :logical   0:661   Min.   : 60.0   Mode :logical   Min.   :-2.600  
##  FALSE:738       1:  0   1st Qu.:120.0   FALSE:528       1st Qu.: 0.000  
##  TRUE :126       2:203   Median :140.0   TRUE :336       Median : 0.500  
##                          Mean   :137.6                   Mean   : 0.876  
##                          3rd Qu.:157.2                   3rd Qu.: 1.500  
##                          Max.   :202.0                   Max.   : 6.200  
##  slope   stage   ageGroups
##  0: 73   0:391   0:  4    
##  1:447   1:252   1: 75    
##  2:344   2:102   2:209    
##          3: 93   3:353    
##          4: 26   4:196    
##                  5: 27

Checkpoint -> so far our handling of the outliers has fulfilled the following things stated before:

  • rest_blood_pressure -> values below 90 (still not values above 180 which are 20).

  • chol -> values above 400 have been substituted by the mean of its corresponding age group. However the values that are 0 (more than 160 values) have still not been taken care of.

  • max_heart_rate and oldpeak -> our assumption that the possible few outliers did not affect the distribution of both features was right. Therefore they were not deleted.

To sum up, the last outliers to analyse are the rest_blood_pressure values above 180 and the chol values that are 0.

# First we will look for the values that are both 0 in chol and above
# 180 in rest_blood_pressure
common_indeces = which(heart$chol == 0 & 
                         heart$rest_blood_pressure >= 180)
common_indeces
## [1] 647 680 700 701 726
# We delete those observations
heart = heart[- common_indeces, ]
summary(heart)
##       age        sex     chest_pain rest_blood_pressure      chol      
##  Min.   :28.00   0:667   0:466      Min.   : 92         Min.   :  0.0  
##  1st Qu.:46.00   1:192   1: 42      1st Qu.:120         1st Qu.:176.5  
##  Median :54.00           2:168      Median :130         Median :224.0  
##  Mean   :53.08           3:183      Mean   :132         Mean   :197.0  
##  3rd Qu.:60.00                      3rd Qu.:140         3rd Qu.:265.0  
##  Max.   :77.00                      Max.   :200         Max.   :394.0  
##  blood_sugar     restecg max_heart_rate    exang            oldpeak       
##  Mode :logical   0:657   Min.   : 60.0   Mode :logical   Min.   :-2.6000  
##  FALSE:734       1:  0   1st Qu.:120.0   FALSE:526       1st Qu.: 0.0000  
##  TRUE :125       2:202   Median :140.0   TRUE :333       Median : 0.5000  
##                          Mean   :137.7                   Mean   : 0.8765  
##                          3rd Qu.:158.0                   3rd Qu.: 1.5000  
##                          Max.   :202.0                   Max.   : 6.2000  
##  slope   stage   ageGroups
##  0: 73   0:390   0:  4    
##  1:444   1:250   1: 75    
##  2:342   2:102   2:209    
##          3: 91   3:352    
##          4: 26   4:192    
##                  5: 27
# Let's check how many outliers for chol and rest_blood_pressure
# are left
sum(heart$chol == 0)                  # 159 left
## [1] 161
sum(heart$rest_blood_pressure >= 180) # 15 left
## [1] 15
# We will study to which range of ages belong this variables
max(heart$age[which(heart$chol == 0)])
## [1] 74
min(heart$age[which(heart$chol == 0)])
## [1] 32
max(heart$age[which(heart$rest_blood_pressure >= 180)])
## [1] 68
min(heart$age[which(heart$rest_blood_pressure >= 180)])
## [1] 39
# We see both variables cover almost the same ranges
heart[which(heart$chol == 0), ]
heart[which(heart$rest_blood_pressure >= 180), ]
# We see that the values at the "frontier" are the wide majority
# So the value strictly greater are
sum(heart$rest_blood_pressure > 180)
## [1] 5
# We will delete those observations and keep the rest 180 values at
# the "frontier"
indeces_delete = which(heart$rest_blood_pressure > 180)
heart = heart[- indeces_delete, ]

summary(heart)
##       age        sex     chest_pain rest_blood_pressure      chol      
##  Min.   :28.00   0:663   0:463      Min.   : 92.0       Min.   :  0.0  
##  1st Qu.:46.00   1:191   1: 42      1st Qu.:120.0       1st Qu.:175.2  
##  Median :54.00           2:166      Median :130.0       Median :223.0  
##  Mean   :53.09           3:183      Mean   :131.6       Mean   :196.6  
##  3rd Qu.:60.00                      3rd Qu.:140.0       3rd Qu.:265.0  
##  Max.   :77.00                      Max.   :180.0       Max.   :394.0  
##  blood_sugar     restecg max_heart_rate    exang            oldpeak       
##  Mode :logical   0:655   Min.   : 60.0   Mode :logical   Min.   :-2.6000  
##  FALSE:731       1:  0   1st Qu.:120.0   FALSE:524       1st Qu.: 0.0000  
##  TRUE :123       2:199   Median :140.0   TRUE :330       Median : 0.5000  
##                          Mean   :137.6                   Mean   : 0.8722  
##                          3rd Qu.:158.0                   3rd Qu.: 1.5000  
##                          Max.   :202.0                   Max.   : 6.2000  
##  slope   stage   ageGroups
##  0: 71   0:389   0:  4    
##  1:442   1:248   1: 74    
##  2:341   2:102   2:209    
##          3: 90   3:349    
##          4: 25   4:191    
##                  5: 27
# In the end, we will procceed with the remaining 0s in chol by the
# mean of group of age; remember the ages covered from 32 to 76
last_indeces = which(heart$chol == 0)
for (i in 1:length(last_indeces)){
  to_use = last_indeces[i]
  if (heart$ageGroups[to_use] == 1){
    heart$chol[to_use] = mean1
  }else if (heart$ageGroups[to_use] == 2){
    heart$chol[to_use] = mean2
  }else if (heart$ageGroups[to_use] == 3){
    heart$chol[to_use] = mean3
  }else{
    heart$chol[to_use] = mean4
  }
}
summary(heart)
##       age        sex     chest_pain rest_blood_pressure      chol      
##  Min.   :28.00   0:663   0:463      Min.   : 92.0       Min.   : 85.0  
##  1st Qu.:46.00   1:191   1: 42      1st Qu.:120.0       1st Qu.:216.0  
##  Median :54.00           2:166      Median :130.0       Median :245.0  
##  Mean   :53.09           3:183      Mean   :131.6       Mean   :242.8  
##  3rd Qu.:60.00                      3rd Qu.:140.0       3rd Qu.:265.0  
##  Max.   :77.00                      Max.   :180.0       Max.   :394.0  
##  blood_sugar     restecg max_heart_rate    exang            oldpeak       
##  Mode :logical   0:655   Min.   : 60.0   Mode :logical   Min.   :-2.6000  
##  FALSE:731       1:  0   1st Qu.:120.0   FALSE:524       1st Qu.: 0.0000  
##  TRUE :123       2:199   Median :140.0   TRUE :330       Median : 0.5000  
##                          Mean   :137.6                   Mean   : 0.8722  
##                          3rd Qu.:158.0                   3rd Qu.: 1.5000  
##                          Max.   :202.0                   Max.   : 6.2000  
##  slope   stage   ageGroups
##  0: 71   0:389   0:  4    
##  1:442   1:248   1: 74    
##  2:341   2:102   2:209    
##          3: 90   3:349    
##          4: 25   4:191    
##                  5: 27
# New boxplot where we have "new" outliers that are not errors and make
# sense
g7 = ggplot(data = heart) + aes(x = heart$chol) + 
     geom_boxplot(fill = "lightblue", color = "blue", 
                  outlier.color = "red", outlier.shape = 16) +
     geom_jitter(aes(y = 0), alpha = 0.5) + theme_minimal() +
     xlab("Cholesterol")

# This boxplot of the variable chol have a few outliers which actually make sense
g7

We have finally have our data ready to go. However, normalising and standarising it could be a great idea to compare the variability and the results between features.

1.4 Normalization:

We need to normalize the numerical variables in order to have the same scale, without losing any information.

# We save the data set before normalizing for future plots
heart_not_norm <- heart

heart$age <- (heart$age - min(heart$age))/(max(heart$age) - min(heart$age))

heart$rest_blood_pressure <- (heart$rest_blood_pressure - min(heart$rest_blood_pressure))/(max(heart$rest_blood_pressure) - min(heart$rest_blood_pressure))

heart$chol <- (heart$chol - min(heart$chol))/(max(heart$chol) - min(heart$chol))

heart$max_heart_rate <- (heart$max_heart_rate - min(heart$max_heart_rate))/(max
(heart$max_heart_rate) - min(heart$max_heart_rate))

heart$oldpeak <- (heart$oldpeak - min(heart$oldpeak))/(max(heart$oldpeak) - min(heart$oldpeak))

summary(heart)
##       age         sex     chest_pain rest_blood_pressure      chol       
##  Min.   :0.0000   0:663   0:463      Min.   :0.0000      Min.   :0.0000  
##  1st Qu.:0.3673   1:191   1: 42      1st Qu.:0.3182      1st Qu.:0.4239  
##  Median :0.5306           2:166      Median :0.4318      Median :0.5178  
##  Mean   :0.5120           3:183      Mean   :0.4499      Mean   :0.5106  
##  3rd Qu.:0.6531                      3rd Qu.:0.5455      3rd Qu.:0.5825  
##  Max.   :1.0000                      Max.   :1.0000      Max.   :1.0000  
##  blood_sugar     restecg max_heart_rate     exang            oldpeak      
##  Mode :logical   0:655   Min.   :0.0000   Mode :logical   Min.   :0.0000  
##  FALSE:731       1:  0   1st Qu.:0.4225   FALSE:524       1st Qu.:0.2955  
##  TRUE :123       2:199   Median :0.5634   TRUE :330       Median :0.3523  
##                          Mean   :0.5467                   Mean   :0.3946  
##                          3rd Qu.:0.6901                   3rd Qu.:0.4659  
##                          Max.   :1.0000                   Max.   :1.0000  
##  slope   stage   ageGroups
##  0: 71   0:389   0:  4    
##  1:442   1:248   1: 74    
##  2:341   2:102   2:209    
##          3: 90   3:349    
##          4: 25   4:191    
##                  5: 27

2. Visualization

In addition to the plots we have already created during the feature engineering process, we are going to create some additional ones to gain a deeper understanding of the dataset.

As a first approach for visualizing the data we are going to create a variable that divides the dataset in 2 group (boolean), where true will be that they have any stage of disease (this may be useful for visualization the future results).

heart <- heart %>%
  mutate(heart_disease = heart$stage != 0)

heart$heart_disease = as_factor(heart$heart_disease)
heart_not_norm$heart_disease = as_factor(heart$heart_disease)

sex_plot = factor(heart$sex, levels = c(0, 1),
                      labels = c("Male", "Female"))
chest_pain_plot = factor(heart$chest_pain,
                          labels = c("asymptomatic",
                                     "typical angina",
                                     "atypical angina",
                                     "non-anginal"),
                          levels = c(0, 1, 2, 3))

# Using this new variable will make much easier the visualization of the data
heart_not_norm %>% ggplot(aes(x=heart_disease, y = max_heart_rate)) + geom_boxplot(fill="lightblue") +
  xlab("Heart Disease?") +
  ylab("Max Heart Rate") +
  labs(title = "Heart Disease vs Max Heart Rate ")

People with a low max heart rate tend to have a heart disease.

ggplot(data = heart_not_norm, aes(x = age, y = max_heart_rate, color = sex_plot)) +
  geom_point()+
  geom_smooth()+
  xlab("Age") +
  ylab("Maximum Heart Rate") +
  labs(title = "Relationship Between Max Heart Rate and Age",
       subtitle = "Divided by gender",
       color = "Gender")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

We can see that there is a very soft correlation between these variables, at higher ages the max heart rates become lower. We can also appreciate that

ggplot(data = heart, aes(x = chest_pain, fill = chest_pain_plot)) +
  geom_bar() +
  xlab("Chest Pain") +
  ylab("Count") +
  labs(title = "Number of Observations by Type of Chest Pain",
       fill = "Chest Pain")

The majority of our observations have asymptomatic chest pain.

ggplot(data = heart, aes(x = chest_pain, y = heart_not_norm$max_heart_rate, 
                         color = chest_pain_plot)) +
  geom_boxplot() +
  geom_jitter(size = 1, alpha = 0.2) +
  facet_grid(~ slope, labeller = labeller(slope = c("0" = "Downsloping",
                                                    "1" = "Flat",
                                                    "2" = "Upsloping"))) +
  xlab("Chest Pain") +
  ylab("Max Heart Rate") +
  labs(title = "Impact of Blood Pressure in Appearance of Chest Pain",
       subtitle = "Divided by slope groups",
       color = "Chest Pain")

We can see that the few people that has a “downsloping” slope belong to the asymptomatic chest pain group. Besides, we can see that the “upsloping” group has, in general, a higher heart rate than the other slope groups. Finally, the people with asymptomatic chest pain (group 0) tend to present a lower maximum heart rate with respect to the others.

ggplot(data = heart_not_norm, aes(x = chol)) +
       geom_histogram(binwidth = 30, fill = "blue", color = "black")

The cholesterol is distributed normally.

ggplot(data = heart_not_norm, aes(x = sex, y = age, fill = sex_plot)) +
  geom_violin()+
  geom_boxplot(alpha = 0.2) +
  geom_jitter(alpha = 0.3) +
  facet_grid(.~ exang) +
  xlab("Gender") +
  ylab("Age") +
  labs(title = "Age and Gender depending of the Exang",
       subtitle = "Exang - Exercise induced angina (True / False)",
       fill = "Gender")

Older people tend to have more an exercise induced angina, no matter their gender.

ggplot(data = heart, aes(x = rest_blood_pressure, fill = sex_plot)) +
       geom_density(alpha = 0.5) +
  xlab("Rest Blood Pressure") +
  labs(fill = "Gender",
       title = "Density of Blood Pressure at Rest by Gender")

The blood pressure at rest of a person does not depend on its gender.

ggplot(data = heart, aes(x = blood_sugar, fill = sex_plot)) +
       geom_bar(position = "fill") +
  xlab("Blood Sugar") +
  ylab("Percentage") +
  labs(title = "Gender Percentages by Blood Sugar > 120 mg/dl",
       fill = "Gender")

The values are pretty similar for both groups of blood sugar. However, the majority of women are <120 mg/dl while the majority of men have a blood sugar level of >120 mg/dl.

ggplot(data = heart_not_norm, aes(x = rest_blood_pressure, 
                                  y = max_heart_rate, 
                                  color = heart_disease)) +
       geom_point() +
       facet_wrap(~ sex,  labeller = labeller(sex = c("0" = "Male",
                                                      "1" = "Female"))) + 
  xlab("Rest Blood Pressure") +
  ylab("Max Heart Rate") +
  labs(title = "Rest Blood Pressure vs Max Heart Rate divided by Gender",
       color = "Heart Disease")

We can see a higher range of heart rates for males than for females. Besides, there is no apparent correlation between heart rate and blood pressure.

ggplot(data = heart, aes(x = chest_pain_plot, fill = exang)) +
      geom_bar(position = "dodge") +
      facet_grid(sex ~ .,  labeller = labeller(sex = c("0" = "Male",
                                                      "1" = "Female"))) +
  xlab("Chest Pain Type") +
  ylab("Count") +
  labs(title = "# of People with each Chest Pain Type",
       subtitle = "Divided by Gender and Exercise-induced angina")

There were less meassurements for gender female than for gender male. They seem to have the same tendency overall with more or less the same proportions. The differences of proportions are not noticeable, however, very few females had exercise-induced angina.

ggplot(data = heart_not_norm, aes(x = exang, y = max_heart_rate, fill = slope)) +
       geom_boxplot() +
  xlab("Exercise-Induced Angina")+
  ylab("Max Heart Rate") +
  labs(title= "Heart Rates of the different slope groups divided by exang",
       subtitle = "exang - exercise-induced angina")

The heart rates of all kinds of slope values are higher for the people without an exercise-induced angina. Besides, the group with level 2 slope has the highest max_heart_rate value no matter the exang.

library(corrplot)
## corrplot 0.92 loaded
correlation_matrix <- cor(heart[, sapply(heart, is.numeric)])
corrplot(correlation_matrix, method = "color", tl.cex = 0.55)

We can appreciate that the max_heart_rate is negatively related to all the other variables. We can also highlight that the strongest relationship is between age and max_heart_rate.

library(GGally)

ggpairs(heart, columns = c("age", "max_heart_rate", "chol", "rest_blood_pressure"),
        aes(color = sex), 
        lower = list(geom_smooth(method = "lm"),
        mapping = aes(fill = sex)))

No important relationships obtained.

3. PCA

We are going to start by making a brief descriptive analysis:

Dimension 1: univariate analysis.

heart_pca = heart[,c("age", "rest_blood_pressure", "chol", "max_heart_rate",
                     "oldpeak")]

boxplot(heart_pca, las=2, col="lightblue")

Dimension 2: bivariate analysis.

ggcorr(heart_pca, label = TRUE)

cor_mat = cor(heart_pca)
heatmap(cor_mat)

We can see that there are no big relationships between the features since all the values obtained by the correlation matrix are very low, which will make the analysis more difficult.

We can start with the Principal Components.

pca = prcomp(heart_pca, scale=T)
summary(pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5
## Standard deviation     1.3103 0.9884 0.9515 0.9051 0.7626
## Proportion of Variance 0.3434 0.1954 0.1811 0.1638 0.1163
## Cumulative Proportion  0.3434 0.5388 0.7198 0.8837 1.0000

We can see that with the 2 first PCs we can only explain a 54% of the variability of the data.

# eigen(cor(heart_numeric))  
# screeplot(pca,main="Screeplot",col="blue",type="barplot",pch=19)

fviz_screeplot(pca, addlabels = TRUE)

3.1 First Principal Component

# First Component
barplot(pca$rotation[,1], las=2, col="#E26D5C")

# Let's try with squared loadings instead
fviz_contrib(pca, choice = "var", axes = 1)

This first component shows us that the most significant fact in the appearance of any sort of heart disease is the age. This is followed by the maximum heart rate, the oldpeak and the blood pressure, which are all around 20%. Finally, the cholesterol does not seem to have a noticeable impact.

Moreover, let us search any further kind of relationship obtained by the PC1 ranking of observations.

# The worst
heart$stage[order(pca$x[,1])][(length(heart)-5):length(heart)]
## [1] 0 3 3 4 3 2
## Levels: 0 1 2 3 4
# The best
heart$stage[order(pca$x[,1])][1:10]
##  [1] 0 3 3 2 2 1 4 1 0 3
## Levels: 0 1 2 3 4

We cannot obtain any useful information just with 1 component related to the disease.

head(get_pca_ind(pca)$contrib[,1]) # this is in %, that is between 0 and 100
##          1          2          3          4          5          6 
## 0.10464184 0.45709259 0.12929655 0.05240512 0.13978004 0.05574121
head((pca$x[,1]^2)/(pca$sdev[1]^2))/dim(heart_pca)[1] # this is between 0 and 1
##            1            2            3            4            5            6 
## 0.0010464184 0.0045709259 0.0012929655 0.0005240512 0.0013978004 0.0005574121
fviz_contrib(pca, choice = "ind", axes = 1, top=100)

3.2 Second Principal Component

# Second Component
barplot(pca$rotation[,2], las=2, col="darkblue")

fviz_contrib(pca, choice = "var", axes = 2)

In this case we obtained that the highest contribution to this Second Principal Component was given by the cholesterol level. The remaining features are almost insignificant. Since rest_blood_pressure, max_heart_rate and cholesterol are positive we understand that having a high cholesterol level did also affect to the other two variables somehow.

We can now plot the PC1 and PC2 to try to obtain some extra information:

biplot(pca)

fviz_pca_var(pca, col.var = "contrib", col.ind = heart$stages)

# fviz_pca_biplot(pca, repel = TRUE)

fviz_pca_biplot(pca, label = "var",
                habillage=heart$heart_disease,
                addEllipses=TRUE,
                ellipse.level=0.95,
                ggtheme = theme_minimal(),
                legend.title = "Health Disease")

We can see that even if both people with and without heart disease are pretty distributed randomly, we can identify a trend. The points that are in the direction of the negative variables of the first principal component (age, old_peak, rest_blood_pressure) tend to represent the people with the heart disease, while the points in the direction of the max_heart_rate tend to be the scores of the healthy group.

highest_ranked = heart$stage[order(get_pca_ind(pca)$contrib[,1],decreasing=T)][1:10]

# percentage of the highest ranked people that are indeed healthy
(sum(highest_ranked == 0)/length(highest_ranked))*100
## [1] 70

We obtained that 70% of the highest ranked people by the PC does not have a heart disease. This may be related to the high influence of max_heart_rate in the PC1 and in having or not a heart disease.

data.frame(z1=-pca$x[,1],z2=pca$x[,2]) %>% 
  ggplot(aes(z1,z2,label= heart$stage, color=heart_not_norm$age)) +
  labs(title="PCA", x="PC1", y="PC2", color = "Age") +
  theme_bw() + 
  scale_color_gradient(low="#FFE1A8", high="#E26D5C")+
  theme(legend.position="bottom") +
  geom_text(size=3, hjust=0.6, vjust=0, check_overlap = TRUE)

It is notable that the age values are very related to the PC1. It makes a lot of sense since the higher contribution to that PC was indeed the age.

data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>% 
  ggplot(aes(z1,z2,label=heart$stage, color=heart_not_norm$chol)) +
  labs(title="PCA", x="PC1", y="PC2", color = "Cholesterol") +
  theme_bw() +
  scale_color_gradient(low="#FFE1A8", high="#E26D5C")+
  theme(legend.position="bottom") +
  geom_text(size=3, hjust=0.6, vjust=0, check_overlap = TRUE)

As we expected, the first principal component does not change with cholesterol but, on the other hand, the PC2 has very different cholesterol values for the positive than for the negative side.

data.frame(z1=pca$x[,1],z2=heart$max_heart_rate) %>% 
  ggplot(aes(z1,z2,label=heart$stage,color=heart_not_norm$rest_blood_pressure)) + 
  labs(title="Performance", x="PC1", y="Max heart rate") +
  scale_color_gradient(low="#FFE1A8", high="#E26D5C") +
  theme_bw() +
  theme(legend.position="bottom") +
  geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE)

Let us plot this in a different way in order to obtain a more visual representation of it.

ggplot()+
  aes(x = pca$x[,1], y = heart_not_norm$max_heart_rate, color = heart$heart_disease) +
  geom_point()+
  geom_smooth(color = "cyan")+
  xlab("PC1")+
  ylab("Max Heart Rate") +
  labs(color = "Heart Disease")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

This plot shows how the max heart rate increases with the First Principal component. Moreover, the values where the max heart rate is higher we obtain a greater number of healthy people. Therefore, heart diseases are related with low heart rates.

3.3 Third Principal Component

# Third Component
barplot(pca$rotation[,3], las=2, col="darkblue")

fviz_contrib(pca, choice = "var", axes = 3)

In this case, the Third Principal Component is predominantly influenced by blood pressure and heart rate. The positive values of rest_blood_pressure, max_heart_rate, and oldpeak in the barplot suggest that this PC primarily relates to the individual’s heartbeat and blood circulation.

df_pca = data.frame(PC1 = pca$x[,1],
                    PC2 = pca$x[,2],
                    PC3 = pca$x[,3])

ggpairs(df_pca, columns = c("PC1", "PC2", "PC3"), 
        aes(color = heart$heart_disease), 
        lower = list(geom_smooth(method = "lm"),
        mapping = aes(fill = heart$heart_disease)))

This plot shows us that the PC1 is the main component to separate the people with a heart disease from the people without it. Besides, since the highest contribution to this component was given by the age, we can conclude that it is the most important variable in the apparition of heart diseases.

4. Factor Analysis

This data set has a peculiar characteristic, which is the amount of categorical variables in detriment of the numerical ones. However, we can take this to our advantage. We know that categorical variables, that are factors, can then be transformed to numerical. Nevertheless, most of the times categorical variables have a huge impact to the results. So, we thought of using 3 auxiliary data sets for the purpose of this analysis. One will have only the numerical features, whereas the other one will be composed of all the possible variables turned into numerical when needed without our target variable (stage) as that would have a huge weight. Also, to clarify our assumptions that the variable stage is taking all the weight of the model, we will do another one with all the variables with no exception.

This way we could compare results between the 2 data sets and also compare with some individual plots of the categorical variables. As it was done in the Lab 5-6 related to clustering.

glimpse(heart)
## Rows: 854
## Columns: 14
## $ age                 <dbl> 0.7142857, 0.7959184, 0.7959184, 0.1836735, 0.2653…
## $ sex                 <fct> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ chest_pain          <fct> 1, 0, 0, 3, 2, 2, 0, 0, 0, 0, 0, 2, 3, 2, 3, 3, 2,…
## $ rest_blood_pressure <dbl> 0.6022727, 0.7727273, 0.3181818, 0.4318182, 0.4318…
## $ chol                <dbl> 0.4789644, 0.6504854, 0.4660194, 0.5339806, 0.3851…
## $ blood_sugar         <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ restecg             <fct> 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0,…
## $ max_heart_rate      <dbl> 0.6338028, 0.3380282, 0.4859155, 0.8943662, 0.7887…
## $ exang               <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRU…
## $ oldpeak             <dbl> 0.5568182, 0.4659091, 0.5909091, 0.6931818, 0.4545…
## $ slope               <fct> 0, 1, 1, 0, 2, 2, 0, 2, 1, 0, 1, 1, 1, 2, 2, 2, 0,…
## $ stage               <fct> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1,…
## $ ageGroups           <fct> 4, 4, 4, 1, 2, 3, 4, 3, 4, 3, 3, 3, 3, 2, 3, 3, 2,…
## $ heart_disease       <fct> FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALS…
# Let's first delete the created variables used before
heart_fa1 = heart %>% select(-heart_disease, - ageGroups)

# we delete the targe variable
heart_fa1 = heart_fa1 %>% select(-stage)

# Data set with all variables (converting to numerical when needed)
for(i in which(sapply(heart_fa1, class) == "factor")) {
  heart_fa1[,i] <- as.numeric(as.character(heart_fa1[,i]))
}

# Then we normalise the variables that were categorical and
# had different values from [0, 1] strictly:
# chest_pain; restcg; slope
heart_fa1$chest_pain = (heart_fa1$chest_pain - 
                          min(heart_fa1$chest_pain)) / 
                        (max(heart_fa1$chest_pain) 
                         - min(heart_fa1$chest_pain))
heart_fa1$restecg = (heart_fa1$restecg - 
                          min(heart_fa1$restecg)) / 
                        (max(heart_fa1$restecg) 
                         - min(heart_fa1$restecg))
heart_fa1$slope = (heart_fa1$slope - 
                          min(heart_fa1$slope)) / 
                        (max(heart_fa1$slope) 
                         - min(heart_fa1$slope))
# Now we have our full numerical data set ready to go

# Data set with just the numerical variables
heart_fa2 = heart[sapply(heart, is.numeric)]

# Factor Analysis for all variables
fa1 = factanal(heart_fa1, factors = 3, rotation="none", 
              scores = "regression")
fa1
## 
## Call:
## factanal(x = heart_fa1, factors = 3, scores = "regression", rotation = "none")
## 
## Uniquenesses:
##                 age                 sex          chest_pain rest_blood_pressure 
##               0.442               0.928               0.745               0.862 
##                chol         blood_sugar             restecg      max_heart_rate 
##               0.963               0.906               0.887               0.005 
##               exang             oldpeak               slope 
##               0.585               0.511               0.644 
## 
## Loadings:
##                     Factor1 Factor2 Factor3
## age                 -0.374   0.418   0.493 
## sex                  0.185  -0.114   0.158 
## chest_pain           0.328  -0.312   0.223 
## rest_blood_pressure -0.104   0.283   0.217 
## chol                         0.140   0.120 
## blood_sugar                  0.164   0.252 
## restecg              0.108   0.243   0.206 
## max_heart_rate       0.997                 
## exang               -0.367   0.461  -0.261 
## oldpeak             -0.162   0.658  -0.172 
## slope                0.385  -0.420   0.179 
## 
##                Factor1 Factor2 Factor3
## SS loadings      1.614   1.293   0.615
## Proportion Var   0.147   0.118   0.056
## Cumulative Var   0.147   0.264   0.320
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 71.4 on 25 degrees of freedom.
## The p-value is 2.38e-06
cbind(fa1$loadings, fa1$uniquenesses)
##                         Factor1      Factor2       Factor3          
## age                 -0.37356846  0.418455561  4.929469e-01 0.4423447
## sex                  0.18546801 -0.114440365  1.580143e-01 0.9275381
## chest_pain           0.32761757 -0.312397003  2.232699e-01 0.7452228
## rest_blood_pressure -0.10374497  0.282533676  2.166863e-01 0.8624582
## chol                -0.04857463  0.140387548  1.202523e-01 0.9634767
## blood_sugar         -0.06304669  0.164057516  2.516583e-01 0.9057794
## restecg              0.10819513  0.243109549  2.058704e-01 0.8868075
## max_heart_rate       0.99747615  0.006440379 -7.067179e-05 0.0050000
## exang               -0.36698962  0.460593452 -2.612099e-01 0.5849391
## oldpeak             -0.16198144  0.658000092 -1.719729e-01 0.5112262
## slope                0.38459234 -0.420000816  1.787718e-01 0.6437300
# Factor Analysis for just the numerical variables
fa2 = factanal(heart_fa2, factors = 2, rotation="none", 
              scores = "regression")
fa2
## 
## Call:
## factanal(x = heart_fa2, factors = 2, scores = "regression", rotation = "none")
## 
## Uniquenesses:
##                 age rest_blood_pressure                chol      max_heart_rate 
##               0.569               0.817               0.956               0.005 
##             oldpeak 
##               0.848 
## 
## Loadings:
##                     Factor1 Factor2
## age                 -0.372   0.541 
## rest_blood_pressure -0.103   0.415 
## chol                         0.205 
## max_heart_rate       0.997         
## oldpeak             -0.159   0.357 
## 
##                Factor1 Factor2
## SS loadings      1.172   0.634
## Proportion Var   0.234   0.127
## Cumulative Var   0.234   0.361
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 0.63 on 1 degree of freedom.
## The p-value is 0.429
cbind(fa2$loadings, fa2$uniquenesses)
##                         Factor1     Factor2          
## age                 -0.37222189 0.541128425 0.5686308
## rest_blood_pressure -0.10294167 0.414851100 0.8173024
## chol                -0.04819274 0.205258458 0.9555460
## max_heart_rate       0.99749392 0.002424049 0.0050000
## oldpeak             -0.15857495 0.356862345 0.8475034

As a first approach, we get a strong difference between using all the variables (except the stage) and using just the strictly numerical.

When we are using only numerical we get poor model with a p-value around 0.7 and see that we will not get strong relationships and a strong explanation of the variability of the data set. As we have seen before in the PCA.

At the time of studying all variables as numerical, except the target variable, we get a strong model, in which all the SS loadings of the factors are above 0.6, which is actually a quite good result.

However, these results are just a first glance. As we are using the function factanal() we need to know what we are doing and how our parameters may affect our study.

In order to see better and clearer the results we will divide from now on between the factor analysis corresponding to each auxiliary data set.

4.1 All variables except stage (FA)

We will start the factor Analysis with the data set heart_fa1 which is composed of all the variables turned into numerical when needed, excluding the feature stage.(Every feature was normalised).

# In short, we have
# heart_fa1 # all(almost)
# heart_fa2 # numerical

# To compare the different results we will follow...
# fa1_X -> factor analisys for the 1st data set
# fa2_X -> factor analisys for the 2nd data set
# Let's start with all the the variables data set
# Changing the scores
fa1_1 = factanal(heart_fa1, factors = 3, rotation= "none", 
              scores = "none")
fa1_1
## 
## Call:
## factanal(x = heart_fa1, factors = 3, scores = "none", rotation = "none")
## 
## Uniquenesses:
##                 age                 sex          chest_pain rest_blood_pressure 
##               0.442               0.928               0.745               0.862 
##                chol         blood_sugar             restecg      max_heart_rate 
##               0.963               0.906               0.887               0.005 
##               exang             oldpeak               slope 
##               0.585               0.511               0.644 
## 
## Loadings:
##                     Factor1 Factor2 Factor3
## age                 -0.374   0.418   0.493 
## sex                  0.185  -0.114   0.158 
## chest_pain           0.328  -0.312   0.223 
## rest_blood_pressure -0.104   0.283   0.217 
## chol                         0.140   0.120 
## blood_sugar                  0.164   0.252 
## restecg              0.108   0.243   0.206 
## max_heart_rate       0.997                 
## exang               -0.367   0.461  -0.261 
## oldpeak             -0.162   0.658  -0.172 
## slope                0.385  -0.420   0.179 
## 
##                Factor1 Factor2 Factor3
## SS loadings      1.614   1.293   0.615
## Proportion Var   0.147   0.118   0.056
## Cumulative Var   0.147   0.264   0.320
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 71.4 on 25 degrees of freedom.
## The p-value is 2.38e-06
# "regression" -> SS loadings      1.645   1.248   0.608
# "Bartlett"   -> SS loadings      1.645   1.248   0.608
# "none"       -> SS loadings      1.645   1.248   0.608
# We see that the scores parameter in this case does not affect
# the result

# Now we will see how the value of rotation affects our results
fa1_2 = factanal(heart_fa1, factors = 3, rotation= "promax", 
              scores = "none")
fa1_2
## 
## Call:
## factanal(x = heart_fa1, factors = 3, scores = "none", rotation = "promax")
## 
## Uniquenesses:
##                 age                 sex          chest_pain rest_blood_pressure 
##               0.442               0.928               0.745               0.862 
##                chol         blood_sugar             restecg      max_heart_rate 
##               0.963               0.906               0.887               0.005 
##               exang             oldpeak               slope 
##               0.585               0.511               0.644 
## 
## Loadings:
##                     Factor1 Factor2 Factor3
## age                         -0.183   0.708 
## sex                 -0.242                 
## chest_pain          -0.477   0.111         
## rest_blood_pressure                  0.333 
## chol                                 0.178 
## blood_sugar                          0.322 
## restecg                      0.201   0.271 
## max_heart_rate      -0.129   0.918  -0.179 
## exang                0.635                 
## oldpeak              0.704   0.214         
## slope               -0.540   0.113         
## 
##                Factor1 Factor2 Factor3
## SS loadings      1.505   1.001   0.874
## Proportion Var   0.137   0.091   0.079
## Cumulative Var   0.137   0.228   0.307
## 
## Factor Correlations:
##         Factor1 Factor2 Factor3
## Factor1  1.0000   0.320 -0.0347
## Factor2  0.3201   1.000 -0.3764
## Factor3 -0.0347  -0.376  1.0000
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 71.4 on 25 degrees of freedom.
## The p-value is 2.38e-06
# None ->    SS loadings      1.645   1.248   0.608 -> 3.501
# Varimax -> SS loadings      1.599   0.975   0.926 -> 3.500
# promax ->  SS loadings      1.469   0.979   0.899 -> 3.347
# We see that there is practically no difference between using
# varimax or no rotation. The relevant conclusion is to not use
# promax (in this case)

# Then, we will see which number of factors is the best
nFactors::plotnScree(nFactors::nScree(x = eigen(cor(heart_fa1))$values),legend = F, main = "Optimal Amount of Factors (Elbow Method)", xlab = "Number of Factors")
abline(h = 0.90, col = "purple", lty = 6)

# By the Elbow Method we see the optimal amount of factors is between 4 and 5
# Let's compare
fa1_3 = factanal(heart_fa1, factors = 4, rotation= "none", 
              scores = "none")
fa1_3
## 
## Call:
## factanal(x = heart_fa1, factors = 4, scores = "none", rotation = "none")
## 
## Uniquenesses:
##                 age                 sex          chest_pain rest_blood_pressure 
##               0.447               0.923               0.698               0.852 
##                chol         blood_sugar             restecg      max_heart_rate 
##               0.929               0.906               0.885               0.005 
##               exang             oldpeak               slope 
##               0.479               0.566               0.396 
## 
## Loadings:
##                     Factor1 Factor2 Factor3 Factor4
## age                 -0.374   0.336   0.548         
## sex                  0.186  -0.125   0.126  -0.103 
## chest_pain           0.328  -0.326   0.151  -0.256 
## rest_blood_pressure -0.104   0.237   0.273         
## chol                         0.109   0.162   0.176 
## blood_sugar                  0.130   0.269         
## restecg              0.108   0.212   0.238         
## max_heart_rate       0.997                         
## exang               -0.368   0.506  -0.184   0.308 
## oldpeak             -0.163   0.637                 
## slope                0.386  -0.547   0.193   0.343 
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      1.617   1.326   0.642   0.329
## Proportion Var   0.147   0.121   0.058   0.030
## Cumulative Var   0.147   0.268   0.326   0.356
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 30.9 on 17 degrees of freedom.
## The p-value is 0.0205
fa1_4 = factanal(heart_fa1, factors = 5, rotation= "none", 
              scores = "none")
fa1_4
## 
## Call:
## factanal(x = heart_fa1, factors = 5, scores = "none", rotation = "none")
## 
## Uniquenesses:
##                 age                 sex          chest_pain rest_blood_pressure 
##               0.459               0.319               0.698               0.849 
##                chol         blood_sugar             restecg      max_heart_rate 
##               0.893               0.885               0.884               0.005 
##               exang             oldpeak               slope 
##               0.492               0.545               0.454 
## 
## Loadings:
##                     Factor1 Factor2 Factor3 Factor4 Factor5
## age                 -0.374   0.249   0.331   0.479         
## sex                  0.187  -0.473   0.635  -0.135         
## chest_pain           0.328  -0.348           0.140  -0.230 
## rest_blood_pressure -0.104   0.190   0.201   0.238         
## chol                                 0.189   0.101   0.238 
## blood_sugar                  0.129           0.295         
## restecg              0.108   0.167   0.176   0.208         
## max_heart_rate       0.997                                 
## exang               -0.368   0.494   0.125  -0.198   0.271 
## oldpeak             -0.163   0.578   0.285  -0.107         
## slope                0.386  -0.447  -0.223   0.221   0.313 
## 
##                Factor1 Factor2 Factor3 Factor4 Factor5
## SS loadings      1.617   1.268   0.772   0.564   0.295
## Proportion Var   0.147   0.115   0.070   0.051   0.027
## Cumulative Var   0.147   0.262   0.332   0.384   0.411
## 
## Test of the hypothesis that 5 factors are sufficient.
## The chi square statistic is 11.75 on 10 degrees of freedom.
## The p-value is 0.302
# 4 Factors -> SS loadings 1.612   1.289   0.638   0.601
# 5 Factors -> SS loadings 1.544   1.486   0.676   0.573   0.254
# We can discard the last factors for fa1_4
# The best option is 4 Factors
fa1_best = fa1_3
fa1_best
## 
## Call:
## factanal(x = heart_fa1, factors = 4, scores = "none", rotation = "none")
## 
## Uniquenesses:
##                 age                 sex          chest_pain rest_blood_pressure 
##               0.447               0.923               0.698               0.852 
##                chol         blood_sugar             restecg      max_heart_rate 
##               0.929               0.906               0.885               0.005 
##               exang             oldpeak               slope 
##               0.479               0.566               0.396 
## 
## Loadings:
##                     Factor1 Factor2 Factor3 Factor4
## age                 -0.374   0.336   0.548         
## sex                  0.186  -0.125   0.126  -0.103 
## chest_pain           0.328  -0.326   0.151  -0.256 
## rest_blood_pressure -0.104   0.237   0.273         
## chol                         0.109   0.162   0.176 
## blood_sugar                  0.130   0.269         
## restecg              0.108   0.212   0.238         
## max_heart_rate       0.997                         
## exang               -0.368   0.506  -0.184   0.308 
## oldpeak             -0.163   0.637                 
## slope                0.386  -0.547   0.193   0.343 
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      1.617   1.326   0.642   0.329
## Proportion Var   0.147   0.121   0.058   0.030
## Cumulative Var   0.147   0.268   0.326   0.356
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 30.9 on 17 degrees of freedom.
## The p-value is 0.0205

After getting the best model, 4-factor with neither scores nor rotation, we will interpret the results.

# Auxiliary data frame to store all the the loadings per factor
factors1 = data.frame(Variable = rownames(fa1_best$loadings),
                      Factor1 = fa1_best$loadings[, 1],
                      Factor2 = fa1_best$loadings[, 2],
                      Factor3 = fa1_best$loadings[, 3],
                      Factor4 = fa1_best$loadings[, 4])
# Now having a column that indicates the factor to which the 
# loading belongs to
loadings_fa1 = tidyr::gather(factors1, Factor, 
                             Loading, -Variable)
# Final barplot with all the Factors and their loadings
gFactors1 = ggplot(data = loadings_fa1, aes(x = Variable, 
                                            y = Loading,
                                            fill = Factor)) +
            geom_bar(stat = "identity") +
            labs(title = "Factor Analysis Loadings",
                 subtitle = "All variables except stage",
                 x = "Variable", 
                 y = "Loadings") + theme_minimal() + 
            theme(axis.text.x = element_text(angle = 45, 
                                             hjust = 1)) +
            facet_wrap(~Factor, scales = "free")
gFactors1

From this first Factor Analysis procedure we see 4 factors, that show us 4 different tendencies of patients. All of the factors are not defined by the cholesterol, it is a variable that does not give us much information about the patient.

  • Factor 1 -> shows us patients that its heart behavior can be explained mainly by the slope.

    The age and the max_heart_rate are negatively correlated, as we expected before. But giving more importance to the max_heart_rate, this observation is quite relevant as in the data set we had much less observations of young people than older people. In addition to the low influence of the sex, it has influence but just a small percentage. Therefore, it is fair to assume that this factor is explaining younger patients.

    With exang and oldpeak we have a positive correlation between them (negative in overall), which makes sense: if you do not show discomfort while exercising (exang) then the normal thing would be to not show any kind of heart depression (oldpeak).

    From these statements we can deduce that this factor is focused on the tendency of healthy (or usual) young patients, as their correlations follow what a healthy (or usual for what a younger) person would get, therefore the weight goes to the slope in which the original value 1 meant no main problems, whereas 0 and 2 meant heart consequences.

  • Factor 2 -> shows directly that the higher the age the lower the max_heart_rate, most of the weight of these patients stays between that duality. But sharing importance with the feature exang.

    Another duality seen, but with a positive relationship is chest_pain and exang. In this case the relationships shows more weight to exang, this relationship is coherent as if an individual experienced some discomfort while exercising exang would be True and the chest_pain would be in between 0, 1 or 2 but no 3. The features oldpeak, rest_blood_pressure and sex follow adequate results.

    Due to the positive relevance of the age, other features related to the appearance of heart issues and the dualities mentioned, we can conclude that this factor explains older patients in moderate heart risks.

  • Factor 3 -> the main thing to mention is an overall positively correlation between all the features (excluding the ones with no relevance). In this factor it can be deduced simply that it shows the patients that have a tendency of having big max_heart_rate, oldpeak, rest_blood_pressure, which are usually older patients but the age does not have an outstanding importance. Due to this high importance on values that when positive affect badly the heart’s health of the patient we deduce that this factor describes the tendency of patients with high heart risks.

  • Factor 4 -> in this factor we see the usual behaviour between age and max_heart_rate. However, here the max_heart_rate is not decreasing as strong as it does the age, which is a good indicator of heart’s health.

    We see lots of indicators of a healthy heart, not experiencing discomfort while exercising while increasing age (exang), almost no effect of increasing age on oldpeak. Also the rest of the features have adequate values. Therefore we can conclude that this factor is explaining the tendency of overall healthy patients.

4.2 Strictly numerical features (FA)

The next Factor Analysis procedure will be focused on the auxiliary data set heart_fa2, which is composed with the strictly numerical variables (just 5).

# Our data set to work is heart_fa2
fa2 = factanal(heart_fa2, factors = 2, rotation = "promax", 
              scores = "none")
fa2
## 
## Call:
## factanal(x = heart_fa2, factors = 2, scores = "none", rotation = "promax")
## 
## Uniquenesses:
##                 age rest_blood_pressure                chol      max_heart_rate 
##               0.569               0.817               0.956               0.005 
##             oldpeak 
##               0.848 
## 
## Loadings:
##                     Factor1 Factor2
## age                 -0.107   0.610 
## rest_blood_pressure          0.453 
## chol                         0.224 
## max_heart_rate       0.966         
## oldpeak                      0.395 
## 
##                Factor1 Factor2
## SS loadings      0.956   0.790
## Proportion Var   0.191   0.158
## Cumulative Var   0.191   0.349
## 
## Factor Correlations:
##         Factor1 Factor2
## Factor1   1.000  -0.366
## Factor2  -0.366   1.000
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 0.63 on 1 degree of freedom.
## The p-value is 0.429
# Let's see the optimal number of factors
nFactors::plotnScree(nFactors::nScree(x = eigen(cor(heart_fa2))$values),legend = F, main = "Optimal Amount of Factors (Elbow Method)", xlab = "Number of Factors")

# 2 factors
# Then study scores and rotation
# scores
# regression -> SS loadings      1.118   0.327
# Bartlett   -> SS loadings      1.118   0.327
# none       -> SS loadings      1.118   0.327
# Rotation
# none       -> SS loadings      1.118   0.327 -> 1.445
# varimax    -> SS loadings      0.866   0.578 -> 1.444
# promax     -> SS loadings      0.936   0.494 -> 1.43
# Between varimax and none we will use the one that helps us more to visualise
fa2_best = factanal(heart_fa2, factors = 2, 
                    rotation = "varimax", scores = "none")

We will study graphically first and analyse the results.

# Plotting (same as before but with less factors)
# Auxiliary data frame to store all the the loadings per factor
factors2 = data.frame(Variable = rownames(fa2_best$loadings),
                      Factor1 = fa2_best$loadings[, 1],
                      Factor2 = fa2_best$loadings[, 2])
# Now having a column that indicates the factor to which the 
# loading belongs to
loadings_fa2 = tidyr::gather(factors2, Factor, 
                             Loading, -Variable)
# Final barplot with all the Factors and their loadings
gFactors2 = ggplot(data = loadings_fa2, aes(x = Variable, 
                                            y = Loading,
                                            fill = Factor)) +
            geom_bar(stat = "identity") +
            labs(title = "Factor Analysis Loadings",
                 subtitle = "Just Numerical Variables",
                 x = "Variable", 
                 y = "Loadings") + theme_minimal() + 
            theme(axis.text.x = element_text(angle = 45, 
                                             hjust = 1)) +
            facet_wrap(~Factor)
gFactors2

From this second Factor Analysis procedure we see 2 factors, that show us 2 different tendencies of patients.

  • Factor 1 -> we see that the higher the age the lower the max_heart_rate, which is a coherent relation. We also see positive relationship with oldpeak and rest_blood_pressure, which is an adequate result knowing that when you get older the usual thing is to be more prone to those kind of problems. We conclude that this factor explains the usual heart’s health risk of a patient.

  • Factor 2 -> explains the usual relationship between age and max_heart_rate, but very attenuated. In contrast to the importance of rest_blood_pressure. This results makes us think that this factor is focused more on unusual health risk on healthy patients. However this factor is not as robust as the one before, as its SS loadings are 0.578.

4.3 All variables (FA)

Applying what we have learned before, we will now create a model but for all the variables including the stage.

heart_fa3 <- heart %>% select(-heart_disease, - ageGroups)

for(i in which(sapply(heart_fa3, class) == "factor")) {
  heart_fa3[,i] <- as.numeric(as.character(heart_fa3[,i]))
}
glimpse(heart_fa3)
## Rows: 854
## Columns: 12
## $ age                 <dbl> 0.7142857, 0.7959184, 0.7959184, 0.1836735, 0.2653…
## $ sex                 <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ chest_pain          <dbl> 1, 0, 0, 3, 2, 2, 0, 0, 0, 0, 0, 2, 3, 2, 3, 3, 2,…
## $ rest_blood_pressure <dbl> 0.6022727, 0.7727273, 0.3181818, 0.4318182, 0.4318…
## $ chol                <dbl> 0.4789644, 0.6504854, 0.4660194, 0.5339806, 0.3851…
## $ blood_sugar         <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ restecg             <dbl> 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0,…
## $ max_heart_rate      <dbl> 0.6338028, 0.3380282, 0.4859155, 0.8943662, 0.7887…
## $ exang               <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRU…
## $ oldpeak             <dbl> 0.5568182, 0.4659091, 0.5909091, 0.6931818, 0.4545…
## $ slope               <dbl> 0, 1, 1, 0, 2, 2, 0, 2, 1, 0, 1, 1, 1, 2, 2, 2, 0,…
## $ stage               <dbl> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1,…
# Then we normalise the variables that were categorical and
# had different values from [0, 1] strictly:
# chest_pain; restcg; slope and stage
heart_fa3$chest_pain = (heart_fa3$chest_pain - 
                          min(heart_fa3$chest_pain)) / 
                        (max(heart_fa3$chest_pain) 
                         - min(heart_fa3$chest_pain))
heart_fa3$restecg = (heart_fa3$restecg - 
                          min(heart_fa3$restecg)) / 
                        (max(heart_fa3$restecg) 
                         - min(heart_fa3$restecg))
heart_fa3$slope = (heart_fa3$slope - 
                          min(heart_fa3$slope)) / 
                        (max(heart_fa3$slope) 
                         - min(heart_fa3$slope))
heart_fa3$stage = (heart_fa3$stage - 
                          min(heart_fa3$stage)) / 
                        (max(heart_fa3$stage) 
                         - min(heart_fa3$stage))

Now we have our full numerical data set ready to go.

# The next step is build the proper Factor model
fa3_1 = factanal(heart_fa3, factors = 3, rotation = "none", 
              scores = "Bartlett")
fa3_1
## 
## Call:
## factanal(x = heart_fa3, factors = 3, scores = "Bartlett", rotation = "none")
## 
## Uniquenesses:
##                 age                 sex          chest_pain rest_blood_pressure 
##               0.416               0.903               0.703               0.875 
##                chol         blood_sugar             restecg      max_heart_rate 
##               0.966               0.908               0.884               0.005 
##               exang             oldpeak               slope               stage 
##               0.602               0.537               0.674               0.515 
## 
## Loadings:
##                     Factor1 Factor2 Factor3
## age                 -0.374   0.415   0.521 
## sex                  0.186  -0.171   0.181 
## chest_pain           0.329  -0.362   0.240 
## rest_blood_pressure -0.104   0.249   0.227 
## chol                         0.130   0.123 
## blood_sugar                  0.171   0.242 
## restecg              0.108   0.251   0.204 
## max_heart_rate       0.997                 
## exang               -0.368   0.453  -0.239 
## oldpeak             -0.163   0.646  -0.141 
## slope                0.385  -0.395   0.147 
## stage               -0.380   0.575         
## 
##                Factor1 Factor2 Factor3
## SS loadings      1.761   1.613   0.637
## Proportion Var   0.147   0.134   0.053
## Cumulative Var   0.147   0.281   0.334
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 113.94 on 33 degrees of freedom.
## The p-value is 7.76e-11
# We make the same comparison as before
# First with scores (does not matter)
# none       -> SS loadings      1.792   1.564   0.622
# regression -> SS loadings      1.792   1.564   0.622
# Bartlett   -> SS loadings      1.792   1.564   0.622
# Then with the rotation (between none and varimax)
# none       -> SS loadings      1.792   1.564   0.622 -> 3.978
# varimax    -> SS loadings      2.001   0.992   0.984 -> 3.977
# promax     -> SS loadings      1.902   0.990   0.888 -> 3.78
# Now number of factors
nFactors::plotnScree(nFactors::nScree(x = eigen(cor(heart_fa3))$values),legend = F, main = "Optimal Amount of Factors (Elbow Method)", xlab = "Number of Factors")

# 4 is the ideal number of factors
fa3_2 = factanal(heart_fa3, factors = 4, rotation= "varimax", 
              scores = "none")
fa3_2
## 
## Call:
## factanal(x = heart_fa3, factors = 4, scores = "none", rotation = "varimax")
## 
## Uniquenesses:
##                 age                 sex          chest_pain rest_blood_pressure 
##               0.428               0.847               0.645               0.872 
##                chol         blood_sugar             restecg      max_heart_rate 
##               0.964               0.896               0.880               0.025 
##               exang             oldpeak               slope               stage 
##               0.630               0.316               0.650               0.455 
## 
## Loadings:
##                     Factor1 Factor2 Factor3 Factor4
## age                          0.156  -0.261   0.687 
## sex                  0.386                         
## chest_pain           0.554  -0.179   0.107         
## rest_blood_pressure          0.143           0.325 
## chol                                         0.185 
## blood_sugar                                  0.319 
## restecg                              0.170   0.298 
## max_heart_rate       0.352  -0.143   0.904  -0.118 
## exang               -0.453   0.363  -0.156         
## oldpeak             -0.240   0.766           0.188 
## slope                0.259  -0.467   0.234         
## stage               -0.585   0.318           0.303 
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      1.264   1.140   1.022   0.966
## Proportion Var   0.105   0.095   0.085   0.080
## Cumulative Var   0.105   0.200   0.285   0.366
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 59.28 on 24 degrees of freedom.
## The p-value is 8.06e-05
# Results
# varimax -> SS loadings      1.578   1.110   1.034   0.935
# 4.657
# none    -> SS loadings      2.376   0.890   0.771   0.620
# 4.657
# Varimax is basically minimising the variance between factors
# Let's study with varimax because it will make our study easier
# to see
fa3_best = factanal(heart_fa3, factors = 4, rotation= "varimax",
                    scores = "none")
fa3_best
## 
## Call:
## factanal(x = heart_fa3, factors = 4, scores = "none", rotation = "varimax")
## 
## Uniquenesses:
##                 age                 sex          chest_pain rest_blood_pressure 
##               0.428               0.847               0.645               0.872 
##                chol         blood_sugar             restecg      max_heart_rate 
##               0.964               0.896               0.880               0.025 
##               exang             oldpeak               slope               stage 
##               0.630               0.316               0.650               0.455 
## 
## Loadings:
##                     Factor1 Factor2 Factor3 Factor4
## age                          0.156  -0.261   0.687 
## sex                  0.386                         
## chest_pain           0.554  -0.179   0.107         
## rest_blood_pressure          0.143           0.325 
## chol                                         0.185 
## blood_sugar                                  0.319 
## restecg                              0.170   0.298 
## max_heart_rate       0.352  -0.143   0.904  -0.118 
## exang               -0.453   0.363  -0.156         
## oldpeak             -0.240   0.766           0.188 
## slope                0.259  -0.467   0.234         
## stage               -0.585   0.318           0.303 
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      1.264   1.140   1.022   0.966
## Proportion Var   0.105   0.095   0.085   0.080
## Cumulative Var   0.105   0.200   0.285   0.366
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 59.28 on 24 degrees of freedom.
## The p-value is 8.06e-05

The next step is to visualise and get conclusions.

# Same procedure as before
# Auxiliary data frame to store all the the loadings per factor
factors3 = data.frame(Variable = rownames(fa3_best$loadings),
                      Factor1 = fa3_best$loadings[, 1],
                      Factor2 = fa3_best$loadings[, 2],
                      Factor3 = fa3_best$loadings[, 3],
                      Factor4 = fa3_best$loadings[, 4])
# Now having a column that indicates the factor to which the 
# loading belongs to
loadings_fa3 = tidyr::gather(factors3, Factor, 
                             Loading, -Variable)
# Final barplot with all the Factors and their loadings
gFactors3 = ggplot(data = loadings_fa3, aes(x = Variable, 
                                            y = Loading,
                                            fill = Factor)) +
            geom_bar(stat = "identity") +
            labs(title = "Factor Analysis Loadings",
                 subtitle = "All variables",
                 x = "Variable", 
                 y = "Loadings") + theme_minimal() + 
            theme(axis.text.x = element_text(angle = 45, 
                                             hjust = 1)) +
            facet_wrap(~Factor, scales = "free")
gFactors3

This Factor Analysis procedure confirmed our assumptions that the variable stage would have a lot of weight and make the model less flexible. We see a clear distinctions of the different stages between the factors, being the factors with higher stage importance the unhealthier tendencies and the ones with the lower stage are the healthier tendencies.

4.4 Conclusion FA

In short, thanks to the different Factor Analysis we studied before, we conclude that the stronger model was using all the features, except the stage, making numerical the ones needed.

The FA showed us that there are 4 main tendencies healthy (or usual) young patients, older patients in moderate heart risks, patients with high heart risks and overall healthy patients. These factors can explain in the most precise way the different tendencies of the patients without loosing flexibility while being accurate enough.

5. Clustering

First we make an initial guess of 3 clusters (healthy, disease and confusing/low disease stage)

set.seed(321)

heart_clust = scale(heart_pca) # Since it has all the numeric variables
k = 3
fit_clust = kmeans(heart_clust, centers=k, nstart=1000)
groups = fit_clust$cluster

par(mfrow=c(1,1))
barplot(table(groups), col="blue")

We obtained 3 groups where the largest number of observations show up in the second cluster.

Interpretation of the centers

centers=fit_clust$centers

barplot(centers[1,], las=2, col="darkblue") # High negative max_heart_rate, negative all except age

barplot(centers[2,], las=2, col="darkblue") # High all positive except max_heart_rate (negative)

barplot(centers[3,], las=2, col="darkblue") # High positive max_heart_rate, high negative age

We understand that the clusters obtained are:

Clustplot

# clusplot
fviz_cluster(fit_clust, data = heart_clust, geom = c("point"), ellipse.type = 'norm') +
  theme_minimal()+
  geom_text(label=heart$stage,hjust=0, vjust=0,size=3,check_overlap = F)+
  scale_fill_brewer(palette="Paired")

We can see that the majority of points that belong to the cluster 3 are from the stage 0 (healthy), however, the other 2 groups are pretty messed up. We can try to reduce to 2 clusters:

k = 2
fit_clust = kmeans(heart_clust, centers=k, nstart=1000)
groups = fit_clust$cluster
barplot(table(groups), col="blue")

barplot(centers[1,], las=2, col="darkblue") # Low positive age, remaining low negative except a very negative max_heart_rate

barplot(centers[2,], las=2, col="darkblue") # All features are high except heart_rate

A priori by looking to the centers information, we can assume that the first cluster is related to young people, maybe more healthy aspects for the other variables, and the second cluster for the older people, with higher risk of having a heart disease.

# clusplot
fviz_cluster(fit_clust, data = heart_clust, geom = c("point"), ellipse.type = 'norm') +
  theme_minimal()+
  geom_text(label=heart$stage,hjust=0, vjust=0,size=3,check_overlap = T)+
  scale_fill_brewer(palette="Paired")

We obtained a cluster with many healthy people whereas the other one has all kind of people. We assume that this happens because the center of the “healthy” cluster has a low age, which is strongly correlated with not having the heart disease.

ggplot()+
  aes(x = heart_not_norm$stage, y = heart_not_norm$age)+
  geom_violin(fill = "lightyellow")+
  geom_jitter(aes(color = as.factor(fit_clust$cluster)))+
  theme_minimal() +
  ggtitle("Heart Disease or not by age in each cluster") +
  ylab("Age") +
  xlab("Stage of the disease") +
  labs(color = "Cluster number")

As we go up in the stage of the disease, we can see that less observations belong to the first cluster. We can also observe that the second cluster is mainly composed by older people (the ones who tend to have the disease).

Silhouette Plot

d <- dist(heart_clust, method="euclidean")  
sil = silhouette(groups, d)
plot(sil, col=1:5, main="", border=NA)

summary(sil)
## Silhouette of 854 units in 2 clusters from silhouette.default(x = groups, dist = d) :
##  Cluster sizes and average silhouette widths:
##       447       407 
## 0.2545616 0.1720360 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.06103  0.12366  0.21319  0.21523  0.31111  0.45441
# the same with factoextra
fviz_silhouette(eclust(heart_clust, "kmeans", stand=TRUE, k=2))

##   cluster size ave.sil.width
## 1       1  407          0.17
## 2       2  447          0.25

We can see that the clusters silhouette width with two clusters can be considered well-clustered. This is because there are very few observations near 0 and even less with a negative value (which means that they might be in the wrong cluster). Therefore, we assume that 2 clusters is the ideal amount. However, we can prove it

fviz_nbclust(heart_clust, kmeans, method = 'silhouette', k.max = 15, nstart = 500)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

fviz_nbclust(heart_clust, kmeans, method = 'wss', k.max = 15, nstart = 500) +
  geom_vline(xintercept = 6, linetype = 2) #elbow location
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

fviz_nbclust(heart_clust, kmeans, method = 'gap_stat', k.max = 10, nstart = 100, nboot = 500)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

We can see that for the silhouette and the gap_stat methods, the best number of clusters is 2. However, the wss method proposes a higher number of clusters, around 6, so lets study this option.

k = 6
fit_clust = kmeans(heart_clust, centers=k, nstart=1000)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations
groups = fit_clust$cluster
barplot(table(groups), col="blue")

centers=fit_clust$centers

barplot(centers[1,], las=2, col="darkblue") 

barplot(centers[2,], las=2, col="darkblue") 

barplot(centers[3,], las=2, col="darkblue") 

barplot(centers[4,], las=2, col="darkblue") 

barplot(centers[5,], las=2, col="darkblue") 

barplot(centers[6,], las=2, col="darkblue") 

# clusplot
fviz_cluster(fit_clust, data = heart_clust, geom = c("point"), ellipse.type = 'norm') +
  theme_minimal()+
  geom_text(label=heart$stage,hjust=0, vjust=0,size=3,check_overlap = T)+
  scale_fill_brewer(palette="Paired")

# the same with factoextra
fviz_silhouette(eclust(heart_clust, "kmeans", stand=TRUE, k=6))

##   cluster size ave.sil.width
## 1       1  124          0.17
## 2       2  178          0.21
## 3       3  134          0.14
## 4       4  174          0.19
## 5       5  106          0.14
## 6       6  138          0.16

We can see that several clusters have negative siluette width and, therefore, it may not be the best number of clusters.

5.1 PAM

Partitioning (clustering) of the data into k clusters *around medoids*

More robust version than k-means, the centers are now patients

fit.pam <- eclust(heart_clust, "pam", stand=TRUE, k=2, graph=F)

fviz_cluster(fit.pam, data = heart_clust, geom = c("point"), pointsize=0.5) +
  theme_minimal() +
  geom_text(label= heart$stage, hjust=0, vjust=0, size=4, check_overlap = T) +
  scale_fill_brewer(palette="Paired")

We can see that the cluster plot is very simmilar to the previous one, where cluster 2 has mostly stage 0 observations (healthy people).

Number of groups by pam:

fviz_nbclust(heart_clust, pam, method = 'silhouette', k.max = 10)

fviz_nbclust(heart_clust, pam, method = 'gap_stat', k.max = 10, nboot = 500)

By both methods we have that 2 clusters would be the best.

# The closer to 1 the more agreement

fit_clust = kmeans(heart_clust, centers=2, nstart=1000)
fit.pam <- eclust(heart_clust, "pam", stand=TRUE, k=2, graph=F)

adjustedRandIndex(fit_clust$cluster, fit.pam$clustering)
## [1] 0.6040579

There is some kind of agreement between them, however it may seem low comparing it to what we expected.

5.2 Kernel k-means

library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:mice':
## 
##     convergence
## The following object is masked from 'package:purrr':
## 
##     cross
## The following object is masked from 'package:ggplot2':
## 
##     alpha
fit.ker <- kkmeans(as.matrix(heart_clust), centers=2, kernel="rbfdot") # Radial Basis kernel (Gaussian)
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
# By default, Gaussian kernel is used
# By default, sigma parameter is estimated

centers(fit.ker)
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] -0.6115822 -0.4361056 -0.2499453  0.4850456 -0.5372957
## [2,]  0.5974252  0.4260105  0.2441596 -0.4738177  0.5248583
size(fit.ker)
## [1] 422 432
withinss(fit.ker)
## [1] 2007.595 2580.554
object.ker = list(data = heart_clust, cluster = fit.ker@.Data)
fviz_cluster(object.ker, geom = c("point"), ellipse=F,pointsize=1)+
  theme_minimal()+
  geom_text(label= heart$stage,hjust=0, vjust=0,size=3,check_overlap = T)+
  scale_fill_brewer(palette="Paired")

adjustedRandIndex(fit_clust$cluster, fit.ker)
## [1] 0.8340121
adjustedRandIndex(fit.pam$cluster, fit.ker)
## [1] 0.6298727

We obtained a value of 0.74 for both the relationship with the basic clustering and with PAM. Therefore, we can say that the results obtained are very similar to eachother.

5.3 Hierarchical clustering

# We decide the distance and linkage
d = dist(scale(heart_clust), method = "euclidean")
hc <- hclust(d, method = "ward.D2") 

hc$labels <- heart$heart_disease

fviz_dend(x = hc, 
          k=2,
          palette = "jco", 
          rect = TRUE, rect_fill = TRUE, cex=0.5,
          rect_border = "jco"          
)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We cannot obtain much information from this plot (We cannot read the labels).

library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:lubridate':
## 
##     %--%, union
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
fviz_dend(x = hc,
          k = 2,
          color_labels_by_k = TRUE,
          cex = 0.8,
          type = "phylogenic",
          repel = TRUE)+  labs(title="Socio-economic-health tree clustering of the world") + theme(axis.text.x=element_blank(),axis.text.y=element_blank())

Not a very useful plot.

5.4 EM clustering

The last clustering technique will be the Expectation-Maximization clustering, which is like k-means but it computes the probabilities that an observation has of belonging to a certain cluster based on probability distributions.

# We will follow the same strategy using the numerical variables
heart_clust_EM = Mclust(scale(heart_clust))
summary(heart_clust_EM)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEV (ellipsoidal, equal shape) model with 7 components: 
## 
##  log-likelihood   n  df       BIC       ICL
##        -5309.57 854 122 -11442.63 -11767.01
## 
## Clustering table:
##   1   2   3   4   5   6   7 
## 162 107 122 109 101 162  91
# The best in this case is using 8 clusters
head(heart_clust_EM$z)
##           [,1]          [,2]         [,3]          [,4]         [,5]
## 1 5.860221e-01  7.785648e-75 4.139582e-01 1.692244e-106 1.970352e-05
## 2 5.598000e-06  2.667645e-32 3.114209e-03  4.242570e-53 9.968749e-01
## 3 3.494458e-03  5.074990e-96 9.965055e-01 9.981290e-134 2.663456e-08
## 4 1.475420e-03 1.639270e-179 8.589986e-01 2.119396e-238 1.183870e-26
## 5 2.126657e-06  7.205824e-33 5.992572e-01  2.615048e-37 2.937967e-01
## 6 6.448621e-01  2.250531e-13 2.412705e-05  4.209271e-14 3.551137e-01
##            [,6]         [,7]
## 1  4.618307e-74 2.660980e-12
## 2  8.065858e-36 5.275209e-06
## 3  1.431838e-95 9.055168e-15
## 4 3.610659e-175 1.395260e-01
## 5  2.299156e-27 1.069440e-01
## 6  9.674786e-09 4.242689e-24
head(heart_clust_EM$classification)
## 1 2 3 4 5 6 
## 1 5 3 3 3 1
fviz_mclust(object = heart_clust_EM, what = "BIC", 
            pallet = "jco")
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Now we get 8 clusters, however, it seems that due to the quantity between some close clusters it will tend in the end to show between 4 and 2 real clusters as we have studied before.

Now, talking more in deep about the values of BIC and ICL we got good results as the lower those values the better.

We will plot our results and see if our assumptions are right.

fviz_mclust(object = heart_clust_EM, what = "classification", 
            geom = "point", pallet = "jco")

Our assumptions was right as we see there is some sort of overlapping between observations along the center of the plot, the right, top left and bottom left.

Therefore if we compare the agreement to the previous clusters, we will not get a number closer to 1 due to the huge difference of number of clusters. (We will not plot a heatmap as we do not have a clear way to analyse results due to the labels).

# Computes the adjusted Rand index comparing two classification
# The closer to 1 the more agreement
adjustedRandIndex(heart_clust_EM$classification, 
                  fit.pam$clustering)
## [1] 0.1191792
# heatmap(scale(heart_clust), scale = "none",
#         distfun = function(x){dist(x, method = "euclidean")},
#        hclustfun = function(x){hclust(x, method = 
#                                "ward.D2")},
#         cexRow = 0.7)  # does not give us clear information

Nevertheless, due to the overlapping we could say that there are between 4 and 2 tendencies of patients relating to their heart’s health, the usual patients are affected by their age, whereas the healthiest are less affected.